linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fsl_espi errors on v5.7.15
@ 2020-08-14  2:48 Chris Packham
  2020-08-14  6:19 ` Heiner Kallweit
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-14  2:48 UTC (permalink / raw)
  To: broonie, mpe, benh, paulus, Heiner Kallweit, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel

Hi,

I'm seeing a problem with accessing spi-nor after upgrading a T2081 
based system to linux v5.7.15

For this board u-boot and the u-boot environment live on spi-nor.

When I use fw_setenv from userspace I get the following kernel logs

# fw_setenv foo=1
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
...

If I run fw_printenv (before getting it into a bad state) it is able to 
display the content of the boards u-boot environment.

If been unsuccessful in producing a setup for bisecting the issue. I do 
know the issue doesn't occur on the old 4.4.x based kernel but that's 
probably not much help.

Any pointers on what the issue (and/or solution) might be.

Thanks,
Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-14  2:48 fsl_espi errors on v5.7.15 Chris Packham
@ 2020-08-14  6:19 ` Heiner Kallweit
  2020-08-16 21:09   ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-14  6:19 UTC (permalink / raw)
  To: Chris Packham, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel

On 14.08.2020 04:48, Chris Packham wrote:
> Hi,
> 
> I'm seeing a problem with accessing spi-nor after upgrading a T2081 
> based system to linux v5.7.15
> 
> For this board u-boot and the u-boot environment live on spi-nor.
> 
> When I use fw_setenv from userspace I get the following kernel logs
> 
> # fw_setenv foo=1
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> ...
> 

This error reporting doesn't exist yet in 4.4. So you may have an issue
under 4.4 too, it's just not reported.
Did you verify that under 4.4 fw_setenv actually has an effect?

> If I run fw_printenv (before getting it into a bad state) it is able to 
> display the content of the boards u-boot environment.
> 

This might indicate an issue with spi being locked. I've seen related
questions, just use the search engine of your choice and check for
fw_setenv and locked.

> If been unsuccessful in producing a setup for bisecting the issue. I do 
> know the issue doesn't occur on the old 4.4.x based kernel but that's 
> probably not much help.
> 
> Any pointers on what the issue (and/or solution) might be.
> 
> Thanks,
> Chris
> 

Heiner

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-14  6:19 ` Heiner Kallweit
@ 2020-08-16 21:09   ` Chris Packham
  2020-08-18 22:44     ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-16 21:09 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 14/08/20 6:19 pm, Heiner Kallweit wrote:
> On 14.08.2020 04:48, Chris Packham wrote:
>> Hi,
>>
>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>> based system to linux v5.7.15
>>
>> For this board u-boot and the u-boot environment live on spi-nor.
>>
>> When I use fw_setenv from userspace I get the following kernel logs
>>
>> # fw_setenv foo=1
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> ...
>>
> This error reporting doesn't exist yet in 4.4. So you may have an issue
> under 4.4 too, it's just not reported.
> Did you verify that under 4.4 fw_setenv actually has an effect?
Just double checked and yes under 4.4 the setting does get saved.
>> If I run fw_printenv (before getting it into a bad state) it is able to
>> display the content of the boards u-boot environment.
>>
> This might indicate an issue with spi being locked. I've seen related
> questions, just use the search engine of your choice and check for
> fw_setenv and locked.
I'm running a version of fw_setenv which includes 
https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't 
be locking things unnecessarily.
>> If been unsuccessful in producing a setup for bisecting the issue. I do
>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>> probably not much help.
>>
>> Any pointers on what the issue (and/or solution) might be.
>>
>> Thanks,
>> Chris
>>
> Heiner

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-16 21:09   ` Chris Packham
@ 2020-08-18 22:44     ` Chris Packham
  2020-08-19  6:15       ` Heiner Kallweit
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-18 22:44 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel

Hi Again,

On 17/08/20 9:09 am, Chris Packham wrote:

>
> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>> On 14.08.2020 04:48, Chris Packham wrote:
>>> Hi,
>>>
>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>> based system to linux v5.7.15
>>>
>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>
>>> When I use fw_setenv from userspace I get the following kernel logs
>>>
>>> # fw_setenv foo=1
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> ...
>>>
>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>> under 4.4 too, it's just not reported.
>> Did you verify that under 4.4 fw_setenv actually has an effect?
> Just double checked and yes under 4.4 the setting does get saved.
>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>> display the content of the boards u-boot environment.
>>>
>> This might indicate an issue with spi being locked. I've seen related
>> questions, just use the search engine of your choice and check for
>> fw_setenv and locked.
> I'm running a version of fw_setenv which includes 
> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't 
> be locking things unnecessarily.
>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>> probably not much help.
>>>
>>> Any pointers on what the issue (and/or solution) might be.

I finally managed to get our board running with a vanilla kernel. With 
corenet64_smp_defconfig I occasionally see

   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

other than the message things seem to be working.

With a custom defconfig I see

   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
   fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
   fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
   ...

and access to the spi-nor does not work until the board is reset.

I'll try and pick apart the differences between the two defconfigs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-18 22:44     ` Chris Packham
@ 2020-08-19  6:15       ` Heiner Kallweit
  2020-08-19 21:08         ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-19  6:15 UTC (permalink / raw)
  To: Chris Packham, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel

On 19.08.2020 00:44, Chris Packham wrote:
> Hi Again,
> 
> On 17/08/20 9:09 am, Chris Packham wrote:
> 
>>
>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>> Hi,
>>>>
>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>> based system to linux v5.7.15
>>>>
>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>
>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>
>>>> # fw_setenv foo=1
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> ...
>>>>
>>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>>> under 4.4 too, it's just not reported.
>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>> Just double checked and yes under 4.4 the setting does get saved.
>>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>>> display the content of the boards u-boot environment.
>>>>
>>> This might indicate an issue with spi being locked. I've seen related
>>> questions, just use the search engine of your choice and check for
>>> fw_setenv and locked.
>> I'm running a version of fw_setenv which includes 
>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't 
>> be locking things unnecessarily.
>>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>>> probably not much help.
>>>>
>>>> Any pointers on what the issue (and/or solution) might be.
> 
> I finally managed to get our board running with a vanilla kernel. With 
> corenet64_smp_defconfig I occasionally see
> 
>    fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> 
> other than the message things seem to be working.
> 
> With a custom defconfig I see
> 
>    fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>    fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>    fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>    ...
> 
> and access to the spi-nor does not work until the board is reset.
> 
> I'll try and pick apart the differences between the two defconfigs.
> 
Also relevant may be:
- Which dts are you using?
- What's the spi-nor type, and at which frequency are you operating it?
- Does the issue still happen if you lower the frequency?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-19  6:15       ` Heiner Kallweit
@ 2020-08-19 21:08         ` Chris Packham
  2020-08-24 22:04           ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-19 21:08 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 19/08/20 6:15 pm, Heiner Kallweit wrote:
> On 19.08.2020 00:44, Chris Packham wrote:
>> Hi Again,
>>
>> On 17/08/20 9:09 am, Chris Packham wrote:
>>
>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>> Hi,
>>>>>
>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>> based system to linux v5.7.15
>>>>>
>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>
>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>
>>>>> # fw_setenv foo=1
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> ...
>>>>>
>>>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>>>> under 4.4 too, it's just not reported.
>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>>>> display the content of the boards u-boot environment.
>>>>>
>>>> This might indicate an issue with spi being locked. I've seen related
>>>> questions, just use the search engine of your choice and check for
>>>> fw_setenv and locked.
>>> I'm running a version of fw_setenv which includes
>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
>>> be locking things unnecessarily.
>>>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>>>> probably not much help.
>>>>>
>>>>> Any pointers on what the issue (and/or solution) might be.
>> I finally managed to get our board running with a vanilla kernel. With
>> corenet64_smp_defconfig I occasionally see
>>
>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>
>> other than the message things seem to be working.
>>
>> With a custom defconfig I see
>>
>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>     ...
>>
>> and access to the spi-nor does not work until the board is reset.
>>
>> I'll try and pick apart the differences between the two defconfigs.

I now think my earlier testing is invalid. I have seen the problem with 
either defconfig if I try hard enough. I had convinced myself that the 
problem was CONFIG_PREEMPT but that was before I found boot-to-boot 
differences with the same kernel.

It's possible that I'm chasing multiple issues with the same symptom.

The error I'm most concerned with is in the sequence
1. boot with old image
2. write environment
3. boot with new image
4. write environment
5. write fails and environment is corrupted

After I recover the system things sometimes seem fine. Until I repeat 
the sequence above.

> Also relevant may be:
> - Which dts are you using?
Custom but based heavily on the t2080rdb.
> - What's the spi-nor type, and at which frequency are you operating it?
The board has several alternate parts for the spi-nor so the dts just 
specifies compatible = "jedec,spi-nor" the actual chip detected on the 
board I have is "n25q032a (4096 Kbytes)". The dts sets spi-max-frequency 
= <10000000> I haven't measured the actual frequency on the bus.
> - Does the issue still happen if you lower the frequency?
I did play around with the frequency initially but I should probably 
give that another go now that I have a better reproduction method.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-19 21:08         ` Chris Packham
@ 2020-08-24 22:04           ` Chris Packham
  2020-08-25  3:54             ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-24 22:04 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus, tiago.brusamarello
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 20/08/20 9:08 am, Chris Packham wrote:
>
> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>> On 19.08.2020 00:44, Chris Packham wrote:
>>> Hi Again,
>>>
>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>
>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>> based system to linux v5.7.15
>>>>>>
>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>
>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>
>>>>>> # fw_setenv foo=1
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> ...
>>>>>>
>>>>> This error reporting doesn't exist yet in 4.4. So you may have an 
>>>>> issue
>>>>> under 4.4 too, it's just not reported.
>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>> If I run fw_printenv (before getting it into a bad state) it is 
>>>>>> able to
>>>>>> display the content of the boards u-boot environment.
>>>>>>
>>>>> This might indicate an issue with spi being locked. I've seen related
>>>>> questions, just use the search engine of your choice and check for
>>>>> fw_setenv and locked.
>>>> I'm running a version of fw_setenv which includes
>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
>>>> be locking things unnecessarily.
>>>>>> If been unsuccessful in producing a setup for bisecting the 
>>>>>> issue. I do
>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but 
>>>>>> that's
>>>>>> probably not much help.
>>>>>>
>>>>>> Any pointers on what the issue (and/or solution) might be.
>>> I finally managed to get our board running with a vanilla kernel. With
>>> corenet64_smp_defconfig I occasionally see
>>>
>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>
>>> other than the message things seem to be working.
>>>
>>> With a custom defconfig I see
>>>
>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>> empty!
>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>     ...
>>>
>>> and access to the spi-nor does not work until the board is reset.
>>>
>>> I'll try and pick apart the differences between the two defconfigs.
>
> I now think my earlier testing is invalid. I have seen the problem 
> with either defconfig if I try hard enough. I had convinced myself 
> that the problem was CONFIG_PREEMPT but that was before I found 
> boot-to-boot differences with the same kernel.
>
> It's possible that I'm chasing multiple issues with the same symptom.
>
> The error I'm most concerned with is in the sequence
> 1. boot with old image
> 2. write environment
> 3. boot with new image
> 4. write environment
> 5. write fails and environment is corrupted
>
> After I recover the system things sometimes seem fine. Until I repeat 
> the sequence above.
>
>> Also relevant may be:
>> - Which dts are you using?
> Custom but based heavily on the t2080rdb.
>> - What's the spi-nor type, and at which frequency are you operating it?
> The board has several alternate parts for the spi-nor so the dts just 
> specifies compatible = "jedec,spi-nor" the actual chip detected on the 
> board I have is "n25q032a (4096 Kbytes)". The dts sets 
> spi-max-frequency = <10000000> I haven't measured the actual frequency 
> on the bus.
>> - Does the issue still happen if you lower the frequency?
> I did play around with the frequency initially but I should probably 
> give that another go now that I have a better reproduction method.

Playing around with the frequency didn't help.

One thing that I've found is that the problem appears to be that I end 
up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO 
then the system is able to keep accessing the spi-nor (albeit with some 
noisy logs).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-24 22:04           ` Chris Packham
@ 2020-08-25  3:54             ` Chris Packham
  2020-08-25  7:22               ` Heiner Kallweit
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-25  3:54 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 25/08/20 10:04 am, Chris Packham wrote:
>
> On 20/08/20 9:08 am, Chris Packham wrote:
>>
>> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>>> On 19.08.2020 00:44, Chris Packham wrote:
>>>> Hi Again,
>>>>
>>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>>
>>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>>> based system to linux v5.7.15
>>>>>>>
>>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>>
>>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>>
>>>>>>> # fw_setenv foo=1
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> ...
>>>>>>>
>>>>>> This error reporting doesn't exist yet in 4.4. So you may have an 
>>>>>> issue
>>>>>> under 4.4 too, it's just not reported.
>>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>>> If I run fw_printenv (before getting it into a bad state) it is 
>>>>>>> able to
>>>>>>> display the content of the boards u-boot environment.
>>>>>>>
>>>>>> This might indicate an issue with spi being locked. I've seen 
>>>>>> related
>>>>>> questions, just use the search engine of your choice and check for
>>>>>> fw_setenv and locked.
>>>>> I'm running a version of fw_setenv which includes
>>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it 
>>>>> shouldn't
>>>>> be locking things unnecessarily.
>>>>>>> If been unsuccessful in producing a setup for bisecting the 
>>>>>>> issue. I do
>>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but 
>>>>>>> that's
>>>>>>> probably not much help.
>>>>>>>
>>>>>>> Any pointers on what the issue (and/or solution) might be.
>>>> I finally managed to get our board running with a vanilla kernel. With
>>>> corenet64_smp_defconfig I occasionally see
>>>>
>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>
>>>> other than the message things seem to be working.
>>>>
>>>> With a custom defconfig I see
>>>>
>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>> empty!
>>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>     ...
>>>>
>>>> and access to the spi-nor does not work until the board is reset.
>>>>
>>>> I'll try and pick apart the differences between the two defconfigs.
>>
>> I now think my earlier testing is invalid. I have seen the problem 
>> with either defconfig if I try hard enough. I had convinced myself 
>> that the problem was CONFIG_PREEMPT but that was before I found 
>> boot-to-boot differences with the same kernel.
>>
>> It's possible that I'm chasing multiple issues with the same symptom.
>>
>> The error I'm most concerned with is in the sequence
>> 1. boot with old image
>> 2. write environment
>> 3. boot with new image
>> 4. write environment
>> 5. write fails and environment is corrupted
>>
>> After I recover the system things sometimes seem fine. Until I repeat 
>> the sequence above.
>>
>>> Also relevant may be:
>>> - Which dts are you using?
>> Custom but based heavily on the t2080rdb.
>>> - What's the spi-nor type, and at which frequency are you operating it?
>> The board has several alternate parts for the spi-nor so the dts just 
>> specifies compatible = "jedec,spi-nor" the actual chip detected on 
>> the board I have is "n25q032a (4096 Kbytes)". The dts sets 
>> spi-max-frequency = <10000000> I haven't measured the actual 
>> frequency on the bus.
>>> - Does the issue still happen if you lower the frequency?
>> I did play around with the frequency initially but I should probably 
>> give that another go now that I have a better reproduction method.
>
> Playing around with the frequency didn't help.
>
> One thing that I've found is that the problem appears to be that I end 
> up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO 
> then the system is able to keep accessing the spi-nor (albeit with 
> some noisy logs).

I've been staring at spi-fsl-espi.c for while now and I think I've 
identified a couple of deficiencies that may or may not be related to my 
issue.

First I think the 'Transfer done but SPIE_DON isn't set' message can be 
generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register. 
We also write back to it to clear the current events. We re-read it in 
fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can 
naturally end up in that situation if we're doing a large read. Consider 
the messages for reading a block of data from a spi-nor chip

  tx = READ_OP + ADDR
  rx = data

We setup the transfer and pump out the tx_buf. The first interrupt goes 
off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo, 
clear ESPI_SPIE and wait for the next interrupt. The next interrupt 
fires and this time we have ESPI_SPIE with just SPIM_RXT set. This 
continues until we've received all the data and we finish with ESPI_SPIE 
having only SPIM_RXT set. When we re-read it we complain that SPIE_DON 
isn't set.

The other deficiency is that we only get an interrupt when the amount of 
data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than 
FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.

I think the reason I'm seeing some variability is because of how fast 
(or slow) the interrupts get processed and how fast the spi-nor chip can 
fill the CPUs rx fifo.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-25  3:54             ` Chris Packham
@ 2020-08-25  7:22               ` Heiner Kallweit
  2020-08-25 22:22                 ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-25  7:22 UTC (permalink / raw)
  To: Chris Packham, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel

On 25.08.2020 05:54, Chris Packham wrote:
> 
> On 25/08/20 10:04 am, Chris Packham wrote:
>>
>> On 20/08/20 9:08 am, Chris Packham wrote:
>>>
>>> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>>>> On 19.08.2020 00:44, Chris Packham wrote:
>>>>> Hi Again,
>>>>>
>>>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>>>
>>>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>>>> based system to linux v5.7.15
>>>>>>>>
>>>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>>>
>>>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>>>
>>>>>>>> # fw_setenv foo=1
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> ...
>>>>>>>>
>>>>>>> This error reporting doesn't exist yet in 4.4. So you may have an 
>>>>>>> issue
>>>>>>> under 4.4 too, it's just not reported.
>>>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>>>> If I run fw_printenv (before getting it into a bad state) it is 
>>>>>>>> able to
>>>>>>>> display the content of the boards u-boot environment.
>>>>>>>>
>>>>>>> This might indicate an issue with spi being locked. I've seen 
>>>>>>> related
>>>>>>> questions, just use the search engine of your choice and check for
>>>>>>> fw_setenv and locked.
>>>>>> I'm running a version of fw_setenv which includes
>>>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it 
>>>>>> shouldn't
>>>>>> be locking things unnecessarily.
>>>>>>>> If been unsuccessful in producing a setup for bisecting the 
>>>>>>>> issue. I do
>>>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but 
>>>>>>>> that's
>>>>>>>> probably not much help.
>>>>>>>>
>>>>>>>> Any pointers on what the issue (and/or solution) might be.
>>>>> I finally managed to get our board running with a vanilla kernel. With
>>>>> corenet64_smp_defconfig I occasionally see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>
>>>>> other than the message things seem to be working.
>>>>>
>>>>> With a custom defconfig I see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't 
>>>>> empty!
>>>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>     ...
>>>>>
>>>>> and access to the spi-nor does not work until the board is reset.
>>>>>
>>>>> I'll try and pick apart the differences between the two defconfigs.
>>>
>>> I now think my earlier testing is invalid. I have seen the problem 
>>> with either defconfig if I try hard enough. I had convinced myself 
>>> that the problem was CONFIG_PREEMPT but that was before I found 
>>> boot-to-boot differences with the same kernel.
>>>
>>> It's possible that I'm chasing multiple issues with the same symptom.
>>>
>>> The error I'm most concerned with is in the sequence
>>> 1. boot with old image
>>> 2. write environment
>>> 3. boot with new image
>>> 4. write environment
>>> 5. write fails and environment is corrupted
>>>
>>> After I recover the system things sometimes seem fine. Until I repeat 
>>> the sequence above.
>>>
>>>> Also relevant may be:
>>>> - Which dts are you using?
>>> Custom but based heavily on the t2080rdb.
>>>> - What's the spi-nor type, and at which frequency are you operating it?
>>> The board has several alternate parts for the spi-nor so the dts just 
>>> specifies compatible = "jedec,spi-nor" the actual chip detected on 
>>> the board I have is "n25q032a (4096 Kbytes)". The dts sets 
>>> spi-max-frequency = <10000000> I haven't measured the actual 
>>> frequency on the bus.
>>>> - Does the issue still happen if you lower the frequency?
>>> I did play around with the frequency initially but I should probably 
>>> give that another go now that I have a better reproduction method.
>>
>> Playing around with the frequency didn't help.
>>
>> One thing that I've found is that the problem appears to be that I end 
>> up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO 
>> then the system is able to keep accessing the spi-nor (albeit with 
>> some noisy logs).
> 
> I've been staring at spi-fsl-espi.c for while now and I think I've 
> identified a couple of deficiencies that may or may not be related to my 
> issue.
> 
> First I think the 'Transfer done but SPIE_DON isn't set' message can be 
> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register. 
> We also write back to it to clear the current events. We re-read it in 
> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can 
> naturally end up in that situation if we're doing a large read. Consider 
> the messages for reading a block of data from a spi-nor chip
> 
>   tx = READ_OP + ADDR
>   rx = data
> 
> We setup the transfer and pump out the tx_buf. The first interrupt goes 
> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo, 
> clear ESPI_SPIE and wait for the next interrupt. The next interrupt 
> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This 
> continues until we've received all the data and we finish with ESPI_SPIE 
> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON 
> isn't set.
> 
> The other deficiency is that we only get an interrupt when the amount of 
> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than 
> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
> 
SPIM_DON will trigger an interrupt once the last characters have been
transferred, and read the remaining characters from the FIFO.

> I think the reason I'm seeing some variability is because of how fast 
> (or slow) the interrupts get processed and how fast the spi-nor chip can 
> fill the CPUs rx fifo.
> 
To rule out timing issues at high bus frequencies I initially asked
for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
or even less, then timing shouldn't be an issue.

Last relevant functional changes have been done almost 4 years ago.
And yours is the first such report I see. So question is what could be so
special with your setup that it seems you're the only one being affected.
The scenarios you describe are standard, therefore much more people
should be affected in case of a driver bug.

You said that kernel config impacts how frequently the issue happens.
Therefore question is what's the diff in kernel config, and how could
the differences be related to SPI.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-25  7:22               ` Heiner Kallweit
@ 2020-08-25 22:22                 ` Chris Packham
  2020-08-26  1:48                   ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-25 22:22 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel

On 25/08/20 7:22 pm, Heiner Kallweit wrote:

<snip>
> I've been staring at spi-fsl-espi.c for while now and I think I've
>> identified a couple of deficiencies that may or may not be related to my
>> issue.
>>
>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>> We also write back to it to clear the current events. We re-read it in
>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>> naturally end up in that situation if we're doing a large read. Consider
>> the messages for reading a block of data from a spi-nor chip
>>
>>    tx = READ_OP + ADDR
>>    rx = data
>>
>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>> continues until we've received all the data and we finish with ESPI_SPIE
>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>> isn't set.
>>
>> The other deficiency is that we only get an interrupt when the amount of
>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
>>
> SPIM_DON will trigger an interrupt once the last characters have been
> transferred, and read the remaining characters from the FIFO.

The T2080RM that I have says the following about the DON bit

"Last character was transmitted. The last character was transmitted and 
a new command can be written for the next frame."

That does at least seem to fit with my assertion that it's all about the 
TX direction. But the fact that it doesn't happen all the time throws 
some doubt on it.

> I think the reason I'm seeing some variability is because of how fast
>> (or slow) the interrupts get processed and how fast the spi-nor chip can
>> fill the CPUs rx fifo.
>>
> To rule out timing issues at high bus frequencies I initially asked
> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
> or even less, then timing shouldn't be an issue.
Yes I've currently got spi-max-frequency = <1000000>; in my dts. I would 
also expect a slower frequency would fit my "DON is for TX" narrative.
> Last relevant functional changes have been done almost 4 years ago.
> And yours is the first such report I see. So question is what could be so
> special with your setup that it seems you're the only one being affected.
> The scenarios you describe are standard, therefore much more people
> should be affected in case of a driver bug.
Agreed. But even on my hardware (which may have a latent issue despite 
being in the field for going on 5 years) the issue only triggers under 
some fairly specific circumstances.
> You said that kernel config impacts how frequently the issue happens.
> Therefore question is what's the diff in kernel config, and how could
> the differences be related to SPI.

It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
impact but every time I found something that seemed to be having an 
impact I've been able to disprove it. I actually think its about how 
busy the system is which may or may not affect when we get round to 
processing the interrupts.

I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
occur on the T2080RDB.

I've had to add the following to expose the environment as a mtd partition

diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
index ff87e67c70da..fbf95fc1fd68 100644
--- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
@@ -116,6 +116,15 @@ flash@0 {
                                 compatible = "micron,n25q512ax3", 
"jedec,spi-nor";
                                 reg = <0>;
                                 spi-max-frequency = <10000000>; /* 
input clock */
+
+                               partition@u-boot {
+                                        reg = <0x00000000 0x00100000>;
+                                        label = "u-boot";
+                                };
+                                partition@u-boot-env {
+                                        reg = <0x00100000 0x00010000>;
+                                        label = "u-boot-env";
+                                };
                         };
                 };

And I'm using the following script to poke at the environment (warning 
if anyone does try this and the bug hits it can render your u-boot 
environment invalid).

cat flash/fw_env_test.sh
#!/bin/sh

generate_fw_env_config()
{
   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name ; do
      echo "$dev $size $erasesize $name"
      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
$erasesize" >/flash/fw_env.config
   done
}

cycles=10
[ $# -ge 1 ] && cycles=$1

generate_fw_env_config

fw_printenv -c /flash/fw_env.config

dmesg -c >/dev/null
x=0
while [ $x -lt $cycles ]; do
     fw_printenv -c /flash/fw_env.config >/dev/null || break
     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
     dmesg -c | grep -q fsl_espi && break;
     let x=x+1
done

echo "Ran $x cycles"

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-25 22:22                 ` Chris Packham
@ 2020-08-26  1:48                   ` Chris Packham
  2020-08-26  6:07                     ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-26  1:48 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 26/08/20 10:22 am, Chris Packham wrote:
> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>
> <snip>
>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>> identified a couple of deficiencies that may or may not be related 
>>> to my
>>> issue.
>>>
>>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>>> We also write back to it to clear the current events. We re-read it in
>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>> naturally end up in that situation if we're doing a large read. 
>>> Consider
>>> the messages for reading a block of data from a spi-nor chip
>>>
>>>    tx = READ_OP + ADDR
>>>    rx = data
>>>
>>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>> continues until we've received all the data and we finish with 
>>> ESPI_SPIE
>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>> isn't set.
>>>
>>> The other deficiency is that we only get an interrupt when the 
>>> amount of
>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>> the fifo.
>>>
>> SPIM_DON will trigger an interrupt once the last characters have been
>> transferred, and read the remaining characters from the FIFO.
>
> The T2080RM that I have says the following about the DON bit
>
> "Last character was transmitted. The last character was transmitted 
> and a new command can be written for the next frame."
>
> That does at least seem to fit with my assertion that it's all about 
> the TX direction. But the fact that it doesn't happen all the time 
> throws some doubt on it.
>
>> I think the reason I'm seeing some variability is because of how fast
>>> (or slow) the interrupts get processed and how fast the spi-nor chip 
>>> can
>>> fill the CPUs rx fifo.
>>>
>> To rule out timing issues at high bus frequencies I initially asked
>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>> or even less, then timing shouldn't be an issue.
> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
> would also expect a slower frequency would fit my "DON is for TX" 
> narrative.
>> Last relevant functional changes have been done almost 4 years ago.
>> And yours is the first such report I see. So question is what could 
>> be so
>> special with your setup that it seems you're the only one being 
>> affected.
>> The scenarios you describe are standard, therefore much more people
>> should be affected in case of a driver bug.
> Agreed. But even on my hardware (which may have a latent issue despite 
> being in the field for going on 5 years) the issue only triggers under 
> some fairly specific circumstances.
>> You said that kernel config impacts how frequently the issue happens.
>> Therefore question is what's the diff in kernel config, and how could
>> the differences be related to SPI.
>
> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
> impact but every time I found something that seemed to be having an 
> impact I've been able to disprove it. I actually think its about how 
> busy the system is which may or may not affect when we get round to 
> processing the interrupts.
>
> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
> occur on the T2080RDB.
>
> I've had to add the following to expose the environment as a mtd 
> partition
>
> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> index ff87e67c70da..fbf95fc1fd68 100644
> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> @@ -116,6 +116,15 @@ flash@0 {
>                                 compatible = "micron,n25q512ax3", 
> "jedec,spi-nor";
>                                 reg = <0>;
>                                 spi-max-frequency = <10000000>; /* 
> input clock */
> +
> +                               partition@u-boot {
> +                                        reg = <0x00000000 0x00100000>;
> +                                        label = "u-boot";
> +                                };
> +                                partition@u-boot-env {
> +                                        reg = <0x00100000 0x00010000>;
> +                                        label = "u-boot-env";
> +                                };
>                         };
>                 };
>
> And I'm using the following script to poke at the environment (warning 
> if anyone does try this and the bug hits it can render your u-boot 
> environment invalid).
>
> cat flash/fw_env_test.sh
> #!/bin/sh
>
> generate_fw_env_config()
> {
>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name 
> ; do
>      echo "$dev $size $erasesize $name"
>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
> $erasesize" >/flash/fw_env.config
>   done
> }
>
> cycles=10
> [ $# -ge 1 ] && cycles=$1
>
> generate_fw_env_config
>
> fw_printenv -c /flash/fw_env.config
>
> dmesg -c >/dev/null
> x=0
> while [ $x -lt $cycles ]; do
>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>     dmesg -c | grep -q fsl_espi && break;
>     let x=x+1
> done
>
> echo "Ran $x cycles"

I've also now seen the RX FIFO not empty error on the T2080RDB

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

With my current workaround of emptying the RX FIFO. It seems survivable. 
Interestingly it only ever seems to be 1 extra byte in the RX FIFO and 
it seems to be after either a READ_SR or a READ_FSR.

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03

 From all the Micron SPI-NOR datasheets I've got access to it is 
possible to continually read the SR/FSR. But I've no idea why it happens 
some times and not others.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-26  1:48                   ` Chris Packham
@ 2020-08-26  6:07                     ` Chris Packham
  2020-08-26  6:38                       ` Heiner Kallweit
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-26  6:07 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel


On 26/08/20 1:48 pm, Chris Packham wrote:
>
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> <snip>
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>> identified a couple of deficiencies that may or may not be related 
>>>> to my
>>>> issue.
>>>>
>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>> can be
>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>> register.
>>>> We also write back to it to clear the current events. We re-read it in
>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>> naturally end up in that situation if we're doing a large read. 
>>>> Consider
>>>> the messages for reading a block of data from a spi-nor chip
>>>>
>>>>    tx = READ_OP + ADDR
>>>>    rx = data
>>>>
>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>> goes
>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>> continues until we've received all the data and we finish with 
>>>> ESPI_SPIE
>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>> isn't set.
>>>>
>>>> The other deficiency is that we only get an interrupt when the 
>>>> amount of
>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>> the fifo.
>>>>
>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>>
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted 
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about 
>> the TX direction. But the fact that it doesn't happen all the time 
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>> chip can
>>>> fill the CPUs rx fifo.
>>>>
>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>> would also expect a slower frequency would fit my "DON is for TX" 
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could 
>>> be so
>>> special with your setup that it seems you're the only one being 
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue 
>> despite being in the field for going on 5 years) the issue only 
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>>
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>> impact but every time I found something that seemed to be having an 
>> impact I've been able to disprove it. I actually think its about how 
>> busy the system is which may or may not affect when we get round to 
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd 
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>                                 compatible = "micron,n25q512ax3", 
>> "jedec,spi-nor";
>>                                 reg = <0>;
>>                                 spi-max-frequency = <10000000>; /* 
>> input clock */
>> +
>> +                               partition@u-boot {
>> +                                        reg = <0x00000000 0x00100000>;
>> +                                        label = "u-boot";
>> +                                };
>> +                                partition@u-boot-env {
>> +                                        reg = <0x00100000 0x00010000>;
>> +                                        label = "u-boot-env";
>> +                                };
>>                         };
>>                 };
>>
>> And I'm using the following script to poke at the environment 
>> (warning if anyone does try this and the bug hits it can render your 
>> u-boot environment invalid).
>>
>> cat flash/fw_env_test.sh
>> #!/bin/sh
>>
>> generate_fw_env_config()
>> {
>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>> name ; do
>>      echo "$dev $size $erasesize $name"
>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>> $erasesize" >/flash/fw_env.config
>>   done
>> }
>>
>> cycles=10
>> [ $# -ge 1 ] && cycles=$1
>>
>> generate_fw_env_config
>>
>> fw_printenv -c /flash/fw_env.config
>>
>> dmesg -c >/dev/null
>> x=0
>> while [ $x -lt $cycles ]; do
>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>     dmesg -c | grep -q fsl_espi && break;
>>     let x=x+1
>> done
>>
>> echo "Ran $x cycles"
>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems 
> survivable. Interestingly it only ever seems to be 1 extra byte in the 
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
> From all the Micron SPI-NOR datasheets I've got access to it is 
> possible to continually read the SR/FSR. But I've no idea why it 
> happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem 
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
C"). My day is just finishing now so I haven't applied too much scrutiny 
to this result. Given the various rabbit holes I've been down on this 
issue already I'd take this information with a good degree of skepticism.

Thanks,
Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-26  6:07                     ` Chris Packham
@ 2020-08-26  6:38                       ` Heiner Kallweit
  2020-08-26 22:26                         ` Chris Packham
  2020-08-27  7:12                         ` Nicholas Piggin
  0 siblings, 2 replies; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-26  6:38 UTC (permalink / raw)
  To: Chris Packham, broonie, mpe, benh, paulus
  Cc: linux-spi, linuxppc-dev, linux-kernel

On 26.08.2020 08:07, Chris Packham wrote:
> 
> On 26/08/20 1:48 pm, Chris Packham wrote:
>>
>> On 26/08/20 10:22 am, Chris Packham wrote:
>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>
>>> <snip>
>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>> identified a couple of deficiencies that may or may not be related 
>>>>> to my
>>>>> issue.
>>>>>
>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>>> can be
>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>>> register.
>>>>> We also write back to it to clear the current events. We re-read it in
>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>> naturally end up in that situation if we're doing a large read. 
>>>>> Consider
>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>
>>>>>    tx = READ_OP + ADDR
>>>>>    rx = data
>>>>>
>>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>>> goes
>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>> continues until we've received all the data and we finish with 
>>>>> ESPI_SPIE
>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>> isn't set.
>>>>>
>>>>> The other deficiency is that we only get an interrupt when the 
>>>>> amount of
>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>>> the fifo.
>>>>>
>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>> transferred, and read the remaining characters from the FIFO.
>>>
>>> The T2080RM that I have says the following about the DON bit
>>>
>>> "Last character was transmitted. The last character was transmitted 
>>> and a new command can be written for the next frame."
>>>
>>> That does at least seem to fit with my assertion that it's all about 
>>> the TX direction. But the fact that it doesn't happen all the time 
>>> throws some doubt on it.
>>>
>>>> I think the reason I'm seeing some variability is because of how fast
>>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>>> chip can
>>>>> fill the CPUs rx fifo.
>>>>>
>>>> To rule out timing issues at high bus frequencies I initially asked
>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>> or even less, then timing shouldn't be an issue.
>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>>> would also expect a slower frequency would fit my "DON is for TX" 
>>> narrative.
>>>> Last relevant functional changes have been done almost 4 years ago.
>>>> And yours is the first such report I see. So question is what could 
>>>> be so
>>>> special with your setup that it seems you're the only one being 
>>>> affected.
>>>> The scenarios you describe are standard, therefore much more people
>>>> should be affected in case of a driver bug.
>>> Agreed. But even on my hardware (which may have a latent issue 
>>> despite being in the field for going on 5 years) the issue only 
>>> triggers under some fairly specific circumstances.
>>>> You said that kernel config impacts how frequently the issue happens.
>>>> Therefore question is what's the diff in kernel config, and how could
>>>> the differences be related to SPI.
>>>
>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>>> impact but every time I found something that seemed to be having an 
>>> impact I've been able to disprove it. I actually think its about how 
>>> busy the system is which may or may not affect when we get round to 
>>> processing the interrupts.
>>>
>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>>> occur on the T2080RDB.
>>>
>>> I've had to add the following to expose the environment as a mtd 
>>> partition
>>>
>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> index ff87e67c70da..fbf95fc1fd68 100644
>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> @@ -116,6 +116,15 @@ flash@0 {
>>>                                 compatible = "micron,n25q512ax3", 
>>> "jedec,spi-nor";
>>>                                 reg = <0>;
>>>                                 spi-max-frequency = <10000000>; /* 
>>> input clock */
>>> +
>>> +                               partition@u-boot {
>>> +                                        reg = <0x00000000 0x00100000>;
>>> +                                        label = "u-boot";
>>> +                                };
>>> +                                partition@u-boot-env {
>>> +                                        reg = <0x00100000 0x00010000>;
>>> +                                        label = "u-boot-env";
>>> +                                };
>>>                         };
>>>                 };
>>>
>>> And I'm using the following script to poke at the environment 
>>> (warning if anyone does try this and the bug hits it can render your 
>>> u-boot environment invalid).
>>>
>>> cat flash/fw_env_test.sh
>>> #!/bin/sh
>>>
>>> generate_fw_env_config()
>>> {
>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>>> name ; do
>>>      echo "$dev $size $erasesize $name"
>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>>> $erasesize" >/flash/fw_env.config
>>>   done
>>> }
>>>
>>> cycles=10
>>> [ $# -ge 1 ] && cycles=$1
>>>
>>> generate_fw_env_config
>>>
>>> fw_printenv -c /flash/fw_env.config
>>>
>>> dmesg -c >/dev/null
>>> x=0
>>> while [ $x -lt $cycles ]; do
>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>     dmesg -c | grep -q fsl_espi && break;
>>>     let x=x+1
>>> done
>>>
>>> echo "Ran $x cycles"
>>
>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>
>> With my current workaround of emptying the RX FIFO. It seems 
>> survivable. Interestingly it only ever seems to be 1 extra byte in the 
>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>> From all the Micron SPI-NOR datasheets I've got access to it is 
>> possible to continually read the SR/FSR. But I've no idea why it 
>> happens some times and not others.
> 
> So I think I've got a reproduction and I think I've bisected the problem 
> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
> C"). My day is just finishing now so I haven't applied too much scrutiny 
> to this result. Given the various rabbit holes I've been down on this 
> issue already I'd take this information with a good degree of skepticism.
> 
OK, so an easy test should be to re-test with a 5.4 kernel.
It doesn't have yet the change you're referring to, and the fsl-espi driver
is basically the same as in 5.7 (just two small changes in 5.7).

> Thanks,
> Chris
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-26  6:38                       ` Heiner Kallweit
@ 2020-08-26 22:26                         ` Chris Packham
  2020-08-27  7:12                         ` Nicholas Piggin
  1 sibling, 0 replies; 32+ messages in thread
From: Chris Packham @ 2020-08-26 22:26 UTC (permalink / raw)
  To: Heiner Kallweit, broonie, mpe, benh, paulus, Nicholas Piggin
  Cc: linux-spi, linuxppc-dev, linux-kernel

(adding Nicholas)

On 26/08/20 6:38 pm, Heiner Kallweit wrote:
> On 26.08.2020 08:07, Chris Packham wrote:
>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>
>>>> <snip>
>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>> to my
>>>>>> issue.
>>>>>>
>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>> can be
>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>> register.
>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>> Consider
>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>
>>>>>>     tx = READ_OP + ADDR
>>>>>>     rx = data
>>>>>>
>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>> goes
>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>> continues until we've received all the data and we finish with
>>>>>> ESPI_SPIE
>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>> isn't set.
>>>>>>
>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>> amount of
>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>> the fifo.
>>>>>>
>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>> transferred, and read the remaining characters from the FIFO.
>>>> The T2080RM that I have says the following about the DON bit
>>>>
>>>> "Last character was transmitted. The last character was transmitted
>>>> and a new command can be written for the next frame."
>>>>
>>>> That does at least seem to fit with my assertion that it's all about
>>>> the TX direction. But the fact that it doesn't happen all the time
>>>> throws some doubt on it.
>>>>
>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>> chip can
>>>>>> fill the CPUs rx fifo.
>>>>>>
>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>> or even less, then timing shouldn't be an issue.
>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>> narrative.
>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>> And yours is the first such report I see. So question is what could
>>>>> be so
>>>>> special with your setup that it seems you're the only one being
>>>>> affected.
>>>>> The scenarios you describe are standard, therefore much more people
>>>>> should be affected in case of a driver bug.
>>>> Agreed. But even on my hardware (which may have a latent issue
>>>> despite being in the field for going on 5 years) the issue only
>>>> triggers under some fairly specific circumstances.
>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>> the differences be related to SPI.
>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>> impact but every time I found something that seemed to be having an
>>>> impact I've been able to disprove it. I actually think its about how
>>>> busy the system is which may or may not affect when we get round to
>>>> processing the interrupts.
>>>>
>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>> occur on the T2080RDB.
>>>>
>>>> I've had to add the following to expose the environment as a mtd
>>>> partition
>>>>
>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>                                  compatible = "micron,n25q512ax3",
>>>> "jedec,spi-nor";
>>>>                                  reg = <0>;
>>>>                                  spi-max-frequency = <10000000>; /*
>>>> input clock */
>>>> +
>>>> +                               partition@u-boot {
>>>> +                                        reg = <0x00000000 0x00100000>;
>>>> +                                        label = "u-boot";
>>>> +                                };
>>>> +                                partition@u-boot-env {
>>>> +                                        reg = <0x00100000 0x00010000>;
>>>> +                                        label = "u-boot-env";
>>>> +                                };
>>>>                          };
>>>>                  };
>>>>
>>>> And I'm using the following script to poke at the environment
>>>> (warning if anyone does try this and the bug hits it can render your
>>>> u-boot environment invalid).
>>>>
>>>> cat flash/fw_env_test.sh
>>>> #!/bin/sh
>>>>
>>>> generate_fw_env_config()
>>>> {
>>>>    cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>> name ; do
>>>>       echo "$dev $size $erasesize $name"
>>>>       [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>> $erasesize" >/flash/fw_env.config
>>>>    done
>>>> }
>>>>
>>>> cycles=10
>>>> [ $# -ge 1 ] && cycles=$1
>>>>
>>>> generate_fw_env_config
>>>>
>>>> fw_printenv -c /flash/fw_env.config
>>>>
>>>> dmesg -c >/dev/null
>>>> x=0
>>>> while [ $x -lt $cycles ]; do
>>>>      fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>      fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>      dmesg -c | grep -q fsl_espi && break;
>>>>      let x=x+1
>>>> done
>>>>
>>>> echo "Ran $x cycles"
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems
>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>  From all the Micron SPI-NOR datasheets I've got access to it is
>>> possible to continually read the SR/FSR. But I've no idea why it
>>> happens some times and not others.
>> So I think I've got a reproduction and I think I've bisected the problem
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>> C"). My day is just finishing now so I haven't applied too much scrutiny
>> to this result. Given the various rabbit holes I've been down on this
>> issue already I'd take this information with a good degree of skepticism.
>>
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi driver
> is basically the same as in 5.7 (just two small changes in 5.7).

Cherry-picking 3282a3da25bd onto 5.4.x I'm able to reproduce the problem 
on the T2080RDB (if anyone want's it I've got a slightly different 
version of my test above that doesn't require touching the dts).

I'm not really sure how 3282a3da25bd affects this. It could be exposing 
a latent issue, the fsl-espi interface does seem a bit sensitive to the 
timing of events.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-26  6:38                       ` Heiner Kallweit
  2020-08-26 22:26                         ` Chris Packham
@ 2020-08-27  7:12                         ` Nicholas Piggin
  2020-08-27 22:07                           ` Chris Packham
  1 sibling, 1 reply; 32+ messages in thread
From: Nicholas Piggin @ 2020-08-27  7:12 UTC (permalink / raw)
  To: benh, broonie, Chris Packham, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
> On 26.08.2020 08:07, Chris Packham wrote:
>> 
>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>
>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>
>>>> <snip>
>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>> identified a couple of deficiencies that may or may not be related 
>>>>>> to my
>>>>>> issue.
>>>>>>
>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>>>> can be
>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>>>> register.
>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>> naturally end up in that situation if we're doing a large read. 
>>>>>> Consider
>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>
>>>>>>    tx = READ_OP + ADDR
>>>>>>    rx = data
>>>>>>
>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>>>> goes
>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>> continues until we've received all the data and we finish with 
>>>>>> ESPI_SPIE
>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>> isn't set.
>>>>>>
>>>>>> The other deficiency is that we only get an interrupt when the 
>>>>>> amount of
>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>>>> the fifo.
>>>>>>
>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>> transferred, and read the remaining characters from the FIFO.
>>>>
>>>> The T2080RM that I have says the following about the DON bit
>>>>
>>>> "Last character was transmitted. The last character was transmitted 
>>>> and a new command can be written for the next frame."
>>>>
>>>> That does at least seem to fit with my assertion that it's all about 
>>>> the TX direction. But the fact that it doesn't happen all the time 
>>>> throws some doubt on it.
>>>>
>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>>>> chip can
>>>>>> fill the CPUs rx fifo.
>>>>>>
>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>> or even less, then timing shouldn't be an issue.
>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>>>> would also expect a slower frequency would fit my "DON is for TX" 
>>>> narrative.
>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>> And yours is the first such report I see. So question is what could 
>>>>> be so
>>>>> special with your setup that it seems you're the only one being 
>>>>> affected.
>>>>> The scenarios you describe are standard, therefore much more people
>>>>> should be affected in case of a driver bug.
>>>> Agreed. But even on my hardware (which may have a latent issue 
>>>> despite being in the field for going on 5 years) the issue only 
>>>> triggers under some fairly specific circumstances.
>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>> the differences be related to SPI.
>>>>
>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>>>> impact but every time I found something that seemed to be having an 
>>>> impact I've been able to disprove it. I actually think its about how 
>>>> busy the system is which may or may not affect when we get round to 
>>>> processing the interrupts.
>>>>
>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>>>> occur on the T2080RDB.
>>>>
>>>> I've had to add the following to expose the environment as a mtd 
>>>> partition
>>>>
>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>                                 compatible = "micron,n25q512ax3", 
>>>> "jedec,spi-nor";
>>>>                                 reg = <0>;
>>>>                                 spi-max-frequency = <10000000>; /* 
>>>> input clock */
>>>> +
>>>> +                               partition@u-boot {
>>>> +                                        reg = <0x00000000 0x00100000>;
>>>> +                                        label = "u-boot";
>>>> +                                };
>>>> +                                partition@u-boot-env {
>>>> +                                        reg = <0x00100000 0x00010000>;
>>>> +                                        label = "u-boot-env";
>>>> +                                };
>>>>                         };
>>>>                 };
>>>>
>>>> And I'm using the following script to poke at the environment 
>>>> (warning if anyone does try this and the bug hits it can render your 
>>>> u-boot environment invalid).
>>>>
>>>> cat flash/fw_env_test.sh
>>>> #!/bin/sh
>>>>
>>>> generate_fw_env_config()
>>>> {
>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>>>> name ; do
>>>>      echo "$dev $size $erasesize $name"
>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>>>> $erasesize" >/flash/fw_env.config
>>>>   done
>>>> }
>>>>
>>>> cycles=10
>>>> [ $# -ge 1 ] && cycles=$1
>>>>
>>>> generate_fw_env_config
>>>>
>>>> fw_printenv -c /flash/fw_env.config
>>>>
>>>> dmesg -c >/dev/null
>>>> x=0
>>>> while [ $x -lt $cycles ]; do
>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>     let x=x+1
>>>> done
>>>>
>>>> echo "Ran $x cycles"
>>>
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems 
>>> survivable. Interestingly it only ever seems to be 1 extra byte in the 
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>> From all the Micron SPI-NOR datasheets I've got access to it is 
>>> possible to continually read the SR/FSR. But I've no idea why it 
>>> happens some times and not others.
>> 
>> So I think I've got a reproduction and I think I've bisected the problem 
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
>> C"). My day is just finishing now so I haven't applied too much scrutiny 
>> to this result. Given the various rabbit holes I've been down on this 
>> issue already I'd take this information with a good degree of skepticism.
>> 
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi driver
> is basically the same as in 5.7 (just two small changes in 5.7).

There's 6cc0c16d82f88 and maybe also other interrupt related patches 
around this time that could affect book E, so it's good if that exact
patch is confirmed.

I've been staring at 3282a3da25bd for a while and nothing immediately 
stands out. It doesn't look like the low level handlers do anything
special (well 0x900 does ack the decrementer, but so does the masked
handler).

Can you try this patch and also enable CONFIG_PPC_IRQ_SOFT_MASK_DEBUG?

Thanks,
Nick

---
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bf21ebd36190..10d339042330 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
 	struct pt_regs regs;
 
 	ppc_save_regs(&regs);
-	regs.softe = IRQS_ALL_DISABLED;
+	regs.softe = IRQS_ENABLED;
 
 again:
 	if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
@@ -349,6 +349,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
 		if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
 			WARN_ON_ONCE(!(mfmsr() & MSR_EE));
 		__hard_irq_disable();
+		local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 	} else {
 		/*
 		 * We should already be hard disabled here. We had bugs
@@ -368,6 +369,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
 		}
 	}
 
+	preempt_disable();
 	irq_soft_mask_set(IRQS_ALL_DISABLED);
 	trace_hardirqs_off();
 
@@ -377,6 +379,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
 	trace_hardirqs_on();
 	irq_soft_mask_set(IRQS_ENABLED);
 	__hard_irq_enable();
+	preempt_enable();
 }
 EXPORT_SYMBOL(arch_local_irq_restore);
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-27  7:12                         ` Nicholas Piggin
@ 2020-08-27 22:07                           ` Chris Packham
  2020-08-30 12:30                             ` Nicholas Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-27 22:07 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

On 27/08/20 7:12 pm, Nicholas Piggin wrote:
> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>> On 26.08.2020 08:07, Chris Packham wrote:
>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>
>>>>> <snip>
>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>> to my
>>>>>>> issue.
>>>>>>>
>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>> can be
>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>> register.
>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>> Consider
>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>
>>>>>>>     tx = READ_OP + ADDR
>>>>>>>     rx = data
>>>>>>>
>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>> goes
>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>> continues until we've received all the data and we finish with
>>>>>>> ESPI_SPIE
>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>> isn't set.
>>>>>>>
>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>> amount of
>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>> the fifo.
>>>>>>>
>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>> The T2080RM that I have says the following about the DON bit
>>>>>
>>>>> "Last character was transmitted. The last character was transmitted
>>>>> and a new command can be written for the next frame."
>>>>>
>>>>> That does at least seem to fit with my assertion that it's all about
>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>> throws some doubt on it.
>>>>>
>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>> chip can
>>>>>>> fill the CPUs rx fifo.
>>>>>>>
>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>> or even less, then timing shouldn't be an issue.
>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>> narrative.
>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>> And yours is the first such report I see. So question is what could
>>>>>> be so
>>>>>> special with your setup that it seems you're the only one being
>>>>>> affected.
>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>> should be affected in case of a driver bug.
>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>> despite being in the field for going on 5 years) the issue only
>>>>> triggers under some fairly specific circumstances.
>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>> the differences be related to SPI.
>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>> impact but every time I found something that seemed to be having an
>>>>> impact I've been able to disprove it. I actually think its about how
>>>>> busy the system is which may or may not affect when we get round to
>>>>> processing the interrupts.
>>>>>
>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>> occur on the T2080RDB.
>>>>>
>>>>> I've had to add the following to expose the environment as a mtd
>>>>> partition
>>>>>
>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>                                  compatible = "micron,n25q512ax3",
>>>>> "jedec,spi-nor";
>>>>>                                  reg = <0>;
>>>>>                                  spi-max-frequency = <10000000>; /*
>>>>> input clock */
>>>>> +
>>>>> +                               partition@u-boot {
>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>> +                                        label = "u-boot";
>>>>> +                                };
>>>>> +                                partition@u-boot-env {
>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>> +                                        label = "u-boot-env";
>>>>> +                                };
>>>>>                          };
>>>>>                  };
>>>>>
>>>>> And I'm using the following script to poke at the environment
>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>> u-boot environment invalid).
>>>>>
>>>>> cat flash/fw_env_test.sh
>>>>> #!/bin/sh
>>>>>
>>>>> generate_fw_env_config()
>>>>> {
>>>>>    cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>> name ; do
>>>>>       echo "$dev $size $erasesize $name"
>>>>>       [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>> $erasesize" >/flash/fw_env.config
>>>>>    done
>>>>> }
>>>>>
>>>>> cycles=10
>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>
>>>>> generate_fw_env_config
>>>>>
>>>>> fw_printenv -c /flash/fw_env.config
>>>>>
>>>>> dmesg -c >/dev/null
>>>>> x=0
>>>>> while [ $x -lt $cycles ]; do
>>>>>      fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>      fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>      dmesg -c | grep -q fsl_espi && break;
>>>>>      let x=x+1
>>>>> done
>>>>>
>>>>> echo "Ran $x cycles"
>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>
>>>> With my current workaround of emptying the RX FIFO. It seems
>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>>  From all the Micron SPI-NOR datasheets I've got access to it is
>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>> happens some times and not others.
>>> So I think I've got a reproduction and I think I've bisected the problem
>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>> to this result. Given the various rabbit holes I've been down on this
>>> issue already I'd take this information with a good degree of skepticism.
>>>
>> OK, so an easy test should be to re-test with a 5.4 kernel.
>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>> is basically the same as in 5.7 (just two small changes in 5.7).
> There's 6cc0c16d82f88 and maybe also other interrupt related patches
> around this time that could affect book E, so it's good if that exact
> patch is confirmed.

My confirmation is basically that I can induce the issue in a 5.4 kernel 
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in 
5.9-rc2 by reverting that one commit.

I both cases it's not exactly a clean cherry-pick/revert so I also 
confirmed the bisection result by building at 3282a3da25bd (which sees 
the issue) and the commit just before (which does not).

> I've been staring at 3282a3da25bd for a while and nothing immediately
> stands out. It doesn't look like the low level handlers do anything
> special (well 0x900 does ack the decrementer, but so does the masked
> handler).
>
> Can you try this patch and also enable CONFIG_PPC_IRQ_SOFT_MASK_DEBUG?
>
> Thanks,
> Nick
>
> ---
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index bf21ebd36190..10d339042330 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
>   	struct pt_regs regs;
>   
>   	ppc_save_regs(&regs);
> -	regs.softe = IRQS_ALL_DISABLED;
> +	regs.softe = IRQS_ENABLED;
>   
>   again:
>   	if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
> @@ -349,6 +349,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
>   		if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
>   			WARN_ON_ONCE(!(mfmsr() & MSR_EE));
>   		__hard_irq_disable();
> +		local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>   	} else {
>   		/*
>   		 * We should already be hard disabled here. We had bugs
> @@ -368,6 +369,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
>   		}
>   	}
>   
> +	preempt_disable();
>   	irq_soft_mask_set(IRQS_ALL_DISABLED);
>   	trace_hardirqs_off();
>   
> @@ -377,6 +379,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
>   	trace_hardirqs_on();
>   	irq_soft_mask_set(IRQS_ENABLED);
>   	__hard_irq_enable();
> +	preempt_enable();
>   }
>   EXPORT_SYMBOL(arch_local_irq_restore);
>   
I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG 
didn't report anything (either with or without the change above).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-27 22:07                           ` Chris Packham
@ 2020-08-30 12:30                             ` Nicholas Piggin
  2020-08-30 13:55                               ` Heiner Kallweit
  2020-08-30 21:00                               ` Chris Packham
  0 siblings, 2 replies; 32+ messages in thread
From: Nicholas Piggin @ 2020-08-30 12:30 UTC (permalink / raw)
  To: benh, broonie, Chris Packham, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>> On 26.08.2020 08:07, Chris Packham wrote:
>>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>>
>>>>>> <snip>
>>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>>> to my
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>>> can be
>>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>>> register.
>>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>>> Consider
>>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>>
>>>>>>>>     tx = READ_OP + ADDR
>>>>>>>>     rx = data
>>>>>>>>
>>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>>> goes
>>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>>> continues until we've received all the data and we finish with
>>>>>>>> ESPI_SPIE
>>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>>> isn't set.
>>>>>>>>
>>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>>> amount of
>>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>>> the fifo.
>>>>>>>>
>>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>>> The T2080RM that I have says the following about the DON bit
>>>>>>
>>>>>> "Last character was transmitted. The last character was transmitted
>>>>>> and a new command can be written for the next frame."
>>>>>>
>>>>>> That does at least seem to fit with my assertion that it's all about
>>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>>> throws some doubt on it.
>>>>>>
>>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>>> chip can
>>>>>>>> fill the CPUs rx fifo.
>>>>>>>>
>>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>>> or even less, then timing shouldn't be an issue.
>>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>>> narrative.
>>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>>> And yours is the first such report I see. So question is what could
>>>>>>> be so
>>>>>>> special with your setup that it seems you're the only one being
>>>>>>> affected.
>>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>>> should be affected in case of a driver bug.
>>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>>> despite being in the field for going on 5 years) the issue only
>>>>>> triggers under some fairly specific circumstances.
>>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>>> the differences be related to SPI.
>>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>>> impact but every time I found something that seemed to be having an
>>>>>> impact I've been able to disprove it. I actually think its about how
>>>>>> busy the system is which may or may not affect when we get round to
>>>>>> processing the interrupts.
>>>>>>
>>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>>> occur on the T2080RDB.
>>>>>>
>>>>>> I've had to add the following to expose the environment as a mtd
>>>>>> partition
>>>>>>
>>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>>                                  compatible = "micron,n25q512ax3",
>>>>>> "jedec,spi-nor";
>>>>>>                                  reg = <0>;
>>>>>>                                  spi-max-frequency = <10000000>; /*
>>>>>> input clock */
>>>>>> +
>>>>>> +                               partition@u-boot {
>>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>>> +                                        label = "u-boot";
>>>>>> +                                };
>>>>>> +                                partition@u-boot-env {
>>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>>> +                                        label = "u-boot-env";
>>>>>> +                                };
>>>>>>                          };
>>>>>>                  };
>>>>>>
>>>>>> And I'm using the following script to poke at the environment
>>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>>> u-boot environment invalid).
>>>>>>
>>>>>> cat flash/fw_env_test.sh
>>>>>> #!/bin/sh
>>>>>>
>>>>>> generate_fw_env_config()
>>>>>> {
>>>>>>    cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>>> name ; do
>>>>>>       echo "$dev $size $erasesize $name"
>>>>>>       [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>>> $erasesize" >/flash/fw_env.config
>>>>>>    done
>>>>>> }
>>>>>>
>>>>>> cycles=10
>>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>>
>>>>>> generate_fw_env_config
>>>>>>
>>>>>> fw_printenv -c /flash/fw_env.config
>>>>>>
>>>>>> dmesg -c >/dev/null
>>>>>> x=0
>>>>>> while [ $x -lt $cycles ]; do
>>>>>>      fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>>      fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>>      dmesg -c | grep -q fsl_espi && break;
>>>>>>      let x=x+1
>>>>>> done
>>>>>>
>>>>>> echo "Ran $x cycles"
>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>
>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>
>>>>> fsl_espi ffe110000.spi: tx 70
>>>>> fsl_espi ffe110000.spi: rx 03
>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>
>>>>>  From all the Micron SPI-NOR datasheets I've got access to it is
>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>> happens some times and not others.
>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>> to this result. Given the various rabbit holes I've been down on this
>>>> issue already I'd take this information with a good degree of skepticism.
>>>>
>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>> is basically the same as in 5.7 (just two small changes in 5.7).
>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>> around this time that could affect book E, so it's good if that exact
>> patch is confirmed.
> 
> My confirmation is basically that I can induce the issue in a 5.4 kernel 
> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in 
> 5.9-rc2 by reverting that one commit.
> 
> I both cases it's not exactly a clean cherry-pick/revert so I also 
> confirmed the bisection result by building at 3282a3da25bd (which sees 
> the issue) and the commit just before (which does not).

Thanks for testing, that confirms it well.

[snip patch]

> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG 
> didn't report anything (either with or without the change above).

Okay, it was a bit of a shot in the dark. I still can't see what
else has changed.

What would cause this, a lost interrupt? A spurious interrupt? Or
higher interrupt latency?

I don't think the patch should cause significantly worse latency,
(it's supposed to be a bit better if anything because it doesn't set
up the full interrupt frame). But it's possible.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-30 12:30                             ` Nicholas Piggin
@ 2020-08-30 13:55                               ` Heiner Kallweit
  2020-08-30 21:00                               ` Chris Packham
  1 sibling, 0 replies; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-30 13:55 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Chris Packham, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

On 30.08.2020 14:30, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>>> On 26.08.2020 08:07, Chris Packham wrote:
>>>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>>>
>>>>>>> <snip>
>>>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>>>> to my
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>>>> can be
>>>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>>>> register.
>>>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>>>> Consider
>>>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>>>
>>>>>>>>>     tx = READ_OP + ADDR
>>>>>>>>>     rx = data
>>>>>>>>>
>>>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>>>> goes
>>>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>>>> continues until we've received all the data and we finish with
>>>>>>>>> ESPI_SPIE
>>>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>>>> isn't set.
>>>>>>>>>
>>>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>>>> amount of
>>>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>>>> the fifo.
>>>>>>>>>
>>>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>>>> The T2080RM that I have says the following about the DON bit
>>>>>>>
>>>>>>> "Last character was transmitted. The last character was transmitted
>>>>>>> and a new command can be written for the next frame."
>>>>>>>
>>>>>>> That does at least seem to fit with my assertion that it's all about
>>>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>>>> throws some doubt on it.
>>>>>>>
>>>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>>>> chip can
>>>>>>>>> fill the CPUs rx fifo.
>>>>>>>>>
>>>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>>>> or even less, then timing shouldn't be an issue.
>>>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>>>> narrative.
>>>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>>>> And yours is the first such report I see. So question is what could
>>>>>>>> be so
>>>>>>>> special with your setup that it seems you're the only one being
>>>>>>>> affected.
>>>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>>>> should be affected in case of a driver bug.
>>>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>>>> despite being in the field for going on 5 years) the issue only
>>>>>>> triggers under some fairly specific circumstances.
>>>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>>>> the differences be related to SPI.
>>>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>>>> impact but every time I found something that seemed to be having an
>>>>>>> impact I've been able to disprove it. I actually think its about how
>>>>>>> busy the system is which may or may not affect when we get round to
>>>>>>> processing the interrupts.
>>>>>>>
>>>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>>>> occur on the T2080RDB.
>>>>>>>
>>>>>>> I've had to add the following to expose the environment as a mtd
>>>>>>> partition
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>>>                                  compatible = "micron,n25q512ax3",
>>>>>>> "jedec,spi-nor";
>>>>>>>                                  reg = <0>;
>>>>>>>                                  spi-max-frequency = <10000000>; /*
>>>>>>> input clock */
>>>>>>> +
>>>>>>> +                               partition@u-boot {
>>>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>>>> +                                        label = "u-boot";
>>>>>>> +                                };
>>>>>>> +                                partition@u-boot-env {
>>>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>>>> +                                        label = "u-boot-env";
>>>>>>> +                                };
>>>>>>>                          };
>>>>>>>                  };
>>>>>>>
>>>>>>> And I'm using the following script to poke at the environment
>>>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>>>> u-boot environment invalid).
>>>>>>>
>>>>>>> cat flash/fw_env_test.sh
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> generate_fw_env_config()
>>>>>>> {
>>>>>>>    cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>>>> name ; do
>>>>>>>       echo "$dev $size $erasesize $name"
>>>>>>>       [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>>>> $erasesize" >/flash/fw_env.config
>>>>>>>    done
>>>>>>> }
>>>>>>>
>>>>>>> cycles=10
>>>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>>>
>>>>>>> generate_fw_env_config
>>>>>>>
>>>>>>> fw_printenv -c /flash/fw_env.config
>>>>>>>
>>>>>>> dmesg -c >/dev/null
>>>>>>> x=0
>>>>>>> while [ $x -lt $cycles ]; do
>>>>>>>      fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>>>      fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>>>      dmesg -c | grep -q fsl_espi && break;
>>>>>>>      let x=x+1
>>>>>>> done
>>>>>>>
>>>>>>> echo "Ran $x cycles"
>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>
>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>
>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>
>>>>>>  From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>> happens some times and not others.
>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>
>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>>
>> My confirmation is basically that I can induce the issue in a 5.4 kernel 
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in 
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also 
>> confirmed the bisection result by building at 3282a3da25bd (which sees 
>> the issue) and the commit just before (which does not).
> 
> Thanks for testing, that confirms it well.
> 
> [snip patch]
> 
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG 
>> didn't report anything (either with or without the change above).
> 
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
> 
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
> 
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.
> 
> Thanks,
> Nick
> 

As additional background:
When I refactored and extended the fsl-espi driver my test device was
a P1014-based WiFi router (running at 800MHz CPU clock). I still use it
and operate the SPI NOR (s25fl128s) at 50MHz in dual read mode
(with a 4.9 kernel). I never had a problem. 
If the issue now occurs on a system with most likely faster CPU and at
a much lower SPI bus frequency, then something must have a significant
impact on interrupt performance. However I can't say exactly which kind
of interrupt issue should cause the SPI issue we see.

Heiner

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-30 12:30                             ` Nicholas Piggin
  2020-08-30 13:55                               ` Heiner Kallweit
@ 2020-08-30 21:00                               ` Chris Packham
  2020-08-30 21:41                                 ` Heiner Kallweit
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-30 21:00 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 31/08/20 12:30 am, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:

<snip>

>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>
>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>
>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>
>>>>>>   From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>> happens some times and not others.
>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>
>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.

My working theory is that the SPI_DON indication is all about the TX 
direction an now that the interrupts are faster we're hitting an error 
because there is still RX activity going on. Heiner disagrees with my 
interpretation of the SPI_DON indication and the fact that it doesn't 
happen every time does throw doubt on it.

I can't really explain the extra RX byte in the fifo. We know how many 
bytes to expect and we pull that many from the fifo so it's not as if 
we're missing an interrupt causing us to skip the last byte. I've been 
looking for some kind of off-by-one calculation but again if it were 
something like that it'd happen all the time.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-30 21:00                               ` Chris Packham
@ 2020-08-30 21:41                                 ` Heiner Kallweit
  2020-08-30 21:59                                   ` Chris Packham
  0 siblings, 1 reply; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-30 21:41 UTC (permalink / raw)
  To: Chris Packham, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

On 30.08.2020 23:00, Chris Packham wrote:
> 
> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> 
> <snip>
> 
>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>
>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>
>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>
>>>>>>>   From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>> happens some times and not others.
>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>
>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>> around this time that could affect book E, so it's good if that exact
>>>> patch is confirmed.
>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>> 5.9-rc2 by reverting that one commit.
>>>
>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>> the issue) and the commit just before (which does not).
>> Thanks for testing, that confirms it well.
>>
>> [snip patch]
>>
>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>> didn't report anything (either with or without the change above).
>> Okay, it was a bit of a shot in the dark. I still can't see what
>> else has changed.
>>
>> What would cause this, a lost interrupt? A spurious interrupt? Or
>> higher interrupt latency?
>>
>> I don't think the patch should cause significantly worse latency,
>> (it's supposed to be a bit better if anything because it doesn't set
>> up the full interrupt frame). But it's possible.
> 
> My working theory is that the SPI_DON indication is all about the TX 
> direction an now that the interrupts are faster we're hitting an error 
> because there is still RX activity going on. Heiner disagrees with my 
> interpretation of the SPI_DON indication and the fact that it doesn't 
> happen every time does throw doubt on it.
> 
It's right that the eSPI spec can be interpreted that SPI_DON refers to
TX only. However this wouldn't really make sense, because also for RX
we program the frame length, and therefore want to be notified once the
full frame was received. Also practical experience shows that SPI_DON
is set also after RX-only transfers.
Typical SPI NOR use case is that you write read command + start address,
followed by a longer read. If the TX-only interpretation would be right,
we'd always end up with SPI_DON not being set.

> I can't really explain the extra RX byte in the fifo. We know how many 
> bytes to expect and we pull that many from the fifo so it's not as if 
> we're missing an interrupt causing us to skip the last byte. I've been 
> looking for some kind of off-by-one calculation but again if it were 
> something like that it'd happen all the time.
> 
Maybe it helps to know what value this extra byte in the FIFO has. Is it:
- a duplicate of the last read byte
- or the next byte (at <end address> + 1)
- or a fixed value, e.g. always 0x00 or 0xff



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-30 21:41                                 ` Heiner Kallweit
@ 2020-08-30 21:59                                   ` Chris Packham
  2020-08-31 12:33                                     ` Heiner Kallweit
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-08-30 21:59 UTC (permalink / raw)
  To: Heiner Kallweit, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 31/08/20 9:41 am, Heiner Kallweit wrote:
> On 30.08.2020 23:00, Chris Packham wrote:
>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> <snip>
>>
>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>
>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>
>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>
>>>>>>>>    From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>> happens some times and not others.
>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>
>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>> around this time that could affect book E, so it's good if that exact
>>>>> patch is confirmed.
>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>> 5.9-rc2 by reverting that one commit.
>>>>
>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>> the issue) and the commit just before (which does not).
>>> Thanks for testing, that confirms it well.
>>>
>>> [snip patch]
>>>
>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>> didn't report anything (either with or without the change above).
>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>> else has changed.
>>>
>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>> higher interrupt latency?
>>>
>>> I don't think the patch should cause significantly worse latency,
>>> (it's supposed to be a bit better if anything because it doesn't set
>>> up the full interrupt frame). But it's possible.
>> My working theory is that the SPI_DON indication is all about the TX
>> direction an now that the interrupts are faster we're hitting an error
>> because there is still RX activity going on. Heiner disagrees with my
>> interpretation of the SPI_DON indication and the fact that it doesn't
>> happen every time does throw doubt on it.
>>
> It's right that the eSPI spec can be interpreted that SPI_DON refers to
> TX only. However this wouldn't really make sense, because also for RX
> we program the frame length, and therefore want to be notified once the
> full frame was received. Also practical experience shows that SPI_DON
> is set also after RX-only transfers.
> Typical SPI NOR use case is that you write read command + start address,
> followed by a longer read. If the TX-only interpretation would be right,
> we'd always end up with SPI_DON not being set.
>
>> I can't really explain the extra RX byte in the fifo. We know how many
>> bytes to expect and we pull that many from the fifo so it's not as if
>> we're missing an interrupt causing us to skip the last byte. I've been
>> looking for some kind of off-by-one calculation but again if it were
>> something like that it'd happen all the time.
>>
> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
> - a duplicate of the last read byte
> - or the next byte (at <end address> + 1)
> - or a fixed value, e.g. always 0x00 or 0xff

The values were up thread a bit but I'll repeat them here

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03


The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
either a READ_SR or a READ_FSR. Never a data read.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-30 21:59                                   ` Chris Packham
@ 2020-08-31 12:33                                     ` Heiner Kallweit
  2020-09-01  1:25                                       ` Chris Packham
  2020-09-01  2:52                                       ` Chris Packham
  0 siblings, 2 replies; 32+ messages in thread
From: Heiner Kallweit @ 2020-08-31 12:33 UTC (permalink / raw)
  To: Chris Packham, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

On 30.08.2020 23:59, Chris Packham wrote:
> 
> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:00, Chris Packham wrote:
>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>> <snip>
>>>
>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>
>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>
>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>
>>>>>>>>>    From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>> happens some times and not others.
>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>
>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>> patch is confirmed.
>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>> 5.9-rc2 by reverting that one commit.
>>>>>
>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>> the issue) and the commit just before (which does not).
>>>> Thanks for testing, that confirms it well.
>>>>
>>>> [snip patch]
>>>>
>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>> didn't report anything (either with or without the change above).
>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>> else has changed.
>>>>
>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>> higher interrupt latency?
>>>>
>>>> I don't think the patch should cause significantly worse latency,
>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>> up the full interrupt frame). But it's possible.
>>> My working theory is that the SPI_DON indication is all about the TX
>>> direction an now that the interrupts are faster we're hitting an error
>>> because there is still RX activity going on. Heiner disagrees with my
>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>> happen every time does throw doubt on it.
>>>
>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>> TX only. However this wouldn't really make sense, because also for RX
>> we program the frame length, and therefore want to be notified once the
>> full frame was received. Also practical experience shows that SPI_DON
>> is set also after RX-only transfers.
>> Typical SPI NOR use case is that you write read command + start address,
>> followed by a longer read. If the TX-only interpretation would be right,
>> we'd always end up with SPI_DON not being set.
>>
>>> I can't really explain the extra RX byte in the fifo. We know how many
>>> bytes to expect and we pull that many from the fifo so it's not as if
>>> we're missing an interrupt causing us to skip the last byte. I've been
>>> looking for some kind of off-by-one calculation but again if it were
>>> something like that it'd happen all the time.
>>>
>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>> - a duplicate of the last read byte
>> - or the next byte (at <end address> + 1)
>> - or a fixed value, e.g. always 0x00 or 0xff
> 
> The values were up thread a bit but I'll repeat them here
> 
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> 
> 
> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
> either a READ_SR or a READ_FSR. Never a data read.
> 
Just remembered something about SPIE_DON:
Transfers are always full duplex, therefore in case of a read the chip
sends dummy zero's. Having said that in case of a read SPIE_DON means
that the last dummy zero was shifted out.

READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
So the issue may have a dependency on the length of the transfer.
However I see no good explanation so far. You can try adding a delay of
a few miroseconds between the following to commands in fsl_espi_bufs().

	fsl_espi_write_reg(espi, ESPI_SPIM, mask);

	/* Prevent filling the fifo from getting interrupted */
	spin_lock_irq(&espi->lock);

Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-31 12:33                                     ` Heiner Kallweit
@ 2020-09-01  1:25                                       ` Chris Packham
  2020-09-01  6:14                                         ` Nicholas Piggin
  2020-09-01  2:52                                       ` Chris Packham
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-09-01  1:25 UTC (permalink / raw)
  To: Heiner Kallweit, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>> <snip>
>>>>
>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>
>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>
>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>> happens some times and not others.
>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>
>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>> patch is confirmed.
>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>
>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>> the issue) and the commit just before (which does not).
>>>>> Thanks for testing, that confirms it well.
>>>>>
>>>>> [snip patch]
>>>>>
>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>> didn't report anything (either with or without the change above).
>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>> else has changed.
>>>>>
>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>> higher interrupt latency?
>>>>>
>>>>> I don't think the patch should cause significantly worse latency,
>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>> up the full interrupt frame). But it's possible.
>>>> My working theory is that the SPI_DON indication is all about the TX
>>>> direction an now that the interrupts are faster we're hitting an error
>>>> because there is still RX activity going on. Heiner disagrees with my
>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>> happen every time does throw doubt on it.
>>>>
>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>> looking for some kind of off-by-one calculation but again if it were
>>>> something like that it'd happen all the time.
>>>>
>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>> - a duplicate of the last read byte
>>> - or the next byte (at <end address> + 1)
>>> - or a fixed value, e.g. always 0x00 or 0xff
>> The values were up thread a bit but I'll repeat them here
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>>
>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>> either a READ_SR or a READ_FSR. Never a data read.
>>
> Just remembered something about SPIE_DON:
> Transfers are always full duplex, therefore in case of a read the chip
> sends dummy zero's. Having said that in case of a read SPIE_DON means
> that the last dummy zero was shifted out.
>
> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> So the issue may have a dependency on the length of the transfer.
> However I see no good explanation so far. You can try adding a delay of
> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> 	/* Prevent filling the fifo from getting interrupted */
> 	spin_lock_irq(&espi->lock);
>
> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
I think this might be heading in the right direction. Playing about with 
a delay does seem to make the two symptoms less likely. Although I have 
to set it quite high (i.e. msleep(100)) to completely avoid any 
possibility of seeing either message.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-08-31 12:33                                     ` Heiner Kallweit
  2020-09-01  1:25                                       ` Chris Packham
@ 2020-09-01  2:52                                       ` Chris Packham
  2020-09-01  7:38                                         ` Heiner Kallweit
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-09-01  2:52 UTC (permalink / raw)
  To: Heiner Kallweit, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>> <snip>
>>>>
>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>
>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>
>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>> happens some times and not others.
>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>
>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>> patch is confirmed.
>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>
>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>> the issue) and the commit just before (which does not).
>>>>> Thanks for testing, that confirms it well.
>>>>>
>>>>> [snip patch]
>>>>>
>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>> didn't report anything (either with or without the change above).
>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>> else has changed.
>>>>>
>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>> higher interrupt latency?
>>>>>
>>>>> I don't think the patch should cause significantly worse latency,
>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>> up the full interrupt frame). But it's possible.
>>>> My working theory is that the SPI_DON indication is all about the TX
>>>> direction an now that the interrupts are faster we're hitting an error
>>>> because there is still RX activity going on. Heiner disagrees with my
>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>> happen every time does throw doubt on it.
>>>>
>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>> looking for some kind of off-by-one calculation but again if it were
>>>> something like that it'd happen all the time.
>>>>
>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>> - a duplicate of the last read byte
>>> - or the next byte (at <end address> + 1)
>>> - or a fixed value, e.g. always 0x00 or 0xff
>> The values were up thread a bit but I'll repeat them here
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>>
>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>> either a READ_SR or a READ_FSR. Never a data read.
>>
> Just remembered something about SPIE_DON:
> Transfers are always full duplex,

That's not true in rxskip mode. Setting rxskip forces the eSPI into 
half-duplex mode. So far all the instances of "extra" rx bytes have been 
when rxskip == 1.

I think that could be where our experience with SPIE_DON differ. In 
full-duplex mode yes DON is always set. In half duplex mode we can end 
up with DON set or not set depending on the interrupt timing.

> therefore in case of a read the chip
> sends dummy zero's. Having said that in case of a read SPIE_DON means
> that the last dummy zero was shifted out.
>
> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> So the issue may have a dependency on the length of the transfer.
> However I see no good explanation so far. You can try adding a delay of
> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> 	/* Prevent filling the fifo from getting interrupted */
> 	spin_lock_irq(&espi->lock);
>
> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-01  1:25                                       ` Chris Packham
@ 2020-09-01  6:14                                         ` Nicholas Piggin
  2020-09-01 23:29                                           ` Chris Packham
  2020-09-03 23:58                                           ` Chris Packham
  0 siblings, 2 replies; 32+ messages in thread
From: Nicholas Piggin @ 2020-09-01  6:14 UTC (permalink / raw)
  To: benh, broonie, Chris Packham, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> 
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> <snip>
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>
>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>> the issue) and the commit just before (which does not).
>>>>>> Thanks for testing, that confirms it well.
>>>>>>
>>>>>> [snip patch]
>>>>>>
>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>> didn't report anything (either with or without the change above).
>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>> else has changed.
>>>>>>
>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>> higher interrupt latency?
>>>>>>
>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>> up the full interrupt frame). But it's possible.
>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>> happen every time does throw doubt on it.
>>>>>
>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>> TX only. However this wouldn't really make sense, because also for RX
>>>> we program the frame length, and therefore want to be notified once the
>>>> full frame was received. Also practical experience shows that SPI_DON
>>>> is set also after RX-only transfers.
>>>> Typical SPI NOR use case is that you write read command + start address,
>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>> we'd always end up with SPI_DON not being set.
>>>>
>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>> something like that it'd happen all the time.
>>>>>
>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>> - a duplicate of the last read byte
>>>> - or the next byte (at <end address> + 1)
>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>> The values were up thread a bit but I'll repeat them here
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>
>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>> either a READ_SR or a READ_FSR. Never a data read.
>>>
>> Just remembered something about SPIE_DON:
>> Transfers are always full duplex, therefore in case of a read the chip
>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>> that the last dummy zero was shifted out.
>>
>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>> So the issue may have a dependency on the length of the transfer.
>> However I see no good explanation so far. You can try adding a delay of
>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>
>> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>
>> 	/* Prevent filling the fifo from getting interrupted */
>> 	spin_lock_irq(&espi->lock);
>>
>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> I think this might be heading in the right direction. Playing about with 
> a delay does seem to make the two symptoms less likely. Although I have 
> to set it quite high (i.e. msleep(100)) to completely avoid any 
> possibility of seeing either message.

The patch might replay the interrupt a little bit faster, but it would 
be a few microseconds at most I think (just from improved code).

Would you be able to ftrace the interrupt handler function and see if you
can see a difference in number or timing of interrupts? I'm at a bit of 
a loss.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-01  2:52                                       ` Chris Packham
@ 2020-09-01  7:38                                         ` Heiner Kallweit
  0 siblings, 0 replies; 32+ messages in thread
From: Heiner Kallweit @ 2020-09-01  7:38 UTC (permalink / raw)
  To: Chris Packham, Nicholas Piggin, benh, broonie, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi

On 01.09.2020 04:52, Chris Packham wrote:
> 
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> <snip>
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>
>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>> the issue) and the commit just before (which does not).
>>>>>> Thanks for testing, that confirms it well.
>>>>>>
>>>>>> [snip patch]
>>>>>>
>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>> didn't report anything (either with or without the change above).
>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>> else has changed.
>>>>>>
>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>> higher interrupt latency?
>>>>>>
>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>> up the full interrupt frame). But it's possible.
>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>> happen every time does throw doubt on it.
>>>>>
>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>> TX only. However this wouldn't really make sense, because also for RX
>>>> we program the frame length, and therefore want to be notified once the
>>>> full frame was received. Also practical experience shows that SPI_DON
>>>> is set also after RX-only transfers.
>>>> Typical SPI NOR use case is that you write read command + start address,
>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>> we'd always end up with SPI_DON not being set.
>>>>
>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>> something like that it'd happen all the time.
>>>>>
>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>> - a duplicate of the last read byte
>>>> - or the next byte (at <end address> + 1)
>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>> The values were up thread a bit but I'll repeat them here
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>
>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>> either a READ_SR or a READ_FSR. Never a data read.
>>>
>> Just remembered something about SPIE_DON:
>> Transfers are always full duplex,
> 
> That's not true in rxskip mode. Setting rxskip forces the eSPI into 
> half-duplex mode. So far all the instances of "extra" rx bytes have been 
> when rxskip == 1.
> 
> I think that could be where our experience with SPIE_DON differ. In 
> full-duplex mode yes DON is always set. In half duplex mode we can end 
> up with DON set or not set depending on the interrupt timing.
> 
Of course I'm not the chip designer, but I think under the hood rxskip
mode doesn't differ from normal full duplex. It just ignores the first
read bytes and sends dummy zero's once the read part starts.

And in normal SPI NOR use cases most transfers are rxskip:
send read command + read data

This can be simply tested / verified: send read command + read 31 bytes
That's less than the FIFO size, therefore only active interrupt source
is SPIE_DON. If your assumption is true the transfer would time out.

>> therefore in case of a read the chip
>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>> that the last dummy zero was shifted out.
>>
>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>> So the issue may have a dependency on the length of the transfer.
>> However I see no good explanation so far. You can try adding a delay of
>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>
>> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>
>> 	/* Prevent filling the fifo from getting interrupted */
>> 	spin_lock_irq(&espi->lock);
>>
>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-01  6:14                                         ` Nicholas Piggin
@ 2020-09-01 23:29                                           ` Chris Packham
  2020-09-01 23:31                                             ` Chris Packham
  2020-09-03 23:58                                           ` Chris Packham
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Packham @ 2020-09-01 23:29 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>> <snip>
>>>>>>
>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>
>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>
>>>>>>>>>>>>      From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>>
>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>>> patch is confirmed.
>>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>
>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>
>>>>>>> [snip patch]
>>>>>>>
>>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>> else has changed.
>>>>>>>
>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>> higher interrupt latency?
>>>>>>>
>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>>> up the full interrupt frame). But it's possible.
>>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>>> happen every time does throw doubt on it.
>>>>>>
>>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>>> TX only. However this wouldn't really make sense, because also for RX
>>>>> we program the frame length, and therefore want to be notified once the
>>>>> full frame was received. Also practical experience shows that SPI_DON
>>>>> is set also after RX-only transfers.
>>>>> Typical SPI NOR use case is that you write read command + start address,
>>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>>> we'd always end up with SPI_DON not being set.
>>>>>
>>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>>> something like that it'd happen all the time.
>>>>>>
>>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>>> - a duplicate of the last read byte
>>>>> - or the next byte (at <end address> + 1)
>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>> The values were up thread a bit but I'll repeat them here
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>>
>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>
>>> Just remembered something about SPIE_DON:
>>> Transfers are always full duplex, therefore in case of a read the chip
>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>> that the last dummy zero was shifted out.
>>>
>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>>> So the issue may have a dependency on the length of the transfer.
>>> However I see no good explanation so far. You can try adding a delay of
>>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>>
>>> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>
>>> 	/* Prevent filling the fifo from getting interrupted */
>>> 	spin_lock_irq(&espi->lock);
>>>
>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
>> I think this might be heading in the right direction. Playing about with
>> a delay does seem to make the two symptoms less likely. Although I have
>> to set it quite high (i.e. msleep(100)) to completely avoid any
>> possibility of seeing either message.
> The patch might replay the interrupt a little bit faster, but it would
> be a few microseconds at most I think (just from improved code).
>
> Would you be able to ftrace the interrupt handler function and see if you
> can see a difference in number or timing of interrupts? I'm at a bit of
> a loss.
This is getting really weird. I was setting up to run with ftrace and 
found I couldn't reproduce it on the tip of Linus's tree (currently 
pointing at e7a522c83b86). But I swear I could last week. Sure enough if 
I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again.
> Thanks,
> Nick
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-01 23:29                                           ` Chris Packham
@ 2020-09-01 23:31                                             ` Chris Packham
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Packham @ 2020-09-01 23:31 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 2/09/20 11:29 am, Chris Packham wrote:
>
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>>> <snip>
>>>>>>>
>>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the 
>>>>>>>>>>>>> T2080RDB
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's 
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>>
>>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra 
>>>>>>>>>>>>> byte in the
>>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a 
>>>>>>>>>>>>> READ_FSR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's 
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's 
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>>
>>>>>>>>>>>>>      From all the Micron SPI-NOR datasheets I've got 
>>>>>>>>>>>>> access to it is
>>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea 
>>>>>>>>>>>>> why it
>>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>>> So I think I've got a reproduction and I think I've 
>>>>>>>>>>>> bisected the problem
>>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft 
>>>>>>>>>>>> interrupt replay in
>>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too 
>>>>>>>>>>>> much scrutiny
>>>>>>>>>>>> to this result. Given the various rabbit holes I've been 
>>>>>>>>>>>> down on this
>>>>>>>>>>>> issue already I'd take this information with a good degree 
>>>>>>>>>>>> of skepticism.
>>>>>>>>>>>>
>>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>>> It doesn't have yet the change you're referring to, and the 
>>>>>>>>>>> fsl-espi driver
>>>>>>>>>>> is basically the same as in 5.7 (just two small changes in 
>>>>>>>>>>> 5.7).
>>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related 
>>>>>>>>>> patches
>>>>>>>>>> around this time that could affect book E, so it's good if 
>>>>>>>>>> that exact
>>>>>>>>>> patch is confirmed.
>>>>>>>>> My confirmation is basically that I can induce the issue in a 
>>>>>>>>> 5.4 kernel
>>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the 
>>>>>>>>> issue in
>>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>>
>>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I 
>>>>>>>>> also
>>>>>>>>> confirmed the bisection result by building at 3282a3da25bd 
>>>>>>>>> (which sees
>>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>>
>>>>>>>> [snip patch]
>>>>>>>>
>>>>>>>>> I still saw the issue with this change applied. 
>>>>>>>>> PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>>> else has changed.
>>>>>>>>
>>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>>> higher interrupt latency?
>>>>>>>>
>>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>>> (it's supposed to be a bit better if anything because it 
>>>>>>>> doesn't set
>>>>>>>> up the full interrupt frame). But it's possible.
>>>>>>> My working theory is that the SPI_DON indication is all about 
>>>>>>> the TX
>>>>>>> direction an now that the interrupts are faster we're hitting an 
>>>>>>> error
>>>>>>> because there is still RX activity going on. Heiner disagrees 
>>>>>>> with my
>>>>>>> interpretation of the SPI_DON indication and the fact that it 
>>>>>>> doesn't
>>>>>>> happen every time does throw doubt on it.
>>>>>>>
>>>>>> It's right that the eSPI spec can be interpreted that SPI_DON 
>>>>>> refers to
>>>>>> TX only. However this wouldn't really make sense, because also 
>>>>>> for RX
>>>>>> we program the frame length, and therefore want to be notified 
>>>>>> once the
>>>>>> full frame was received. Also practical experience shows that 
>>>>>> SPI_DON
>>>>>> is set also after RX-only transfers.
>>>>>> Typical SPI NOR use case is that you write read command + start 
>>>>>> address,
>>>>>> followed by a longer read. If the TX-only interpretation would be 
>>>>>> right,
>>>>>> we'd always end up with SPI_DON not being set.
>>>>>>
>>>>>>> I can't really explain the extra RX byte in the fifo. We know 
>>>>>>> how many
>>>>>>> bytes to expect and we pull that many from the fifo so it's not 
>>>>>>> as if
>>>>>>> we're missing an interrupt causing us to skip the last byte. 
>>>>>>> I've been
>>>>>>> looking for some kind of off-by-one calculation but again if it 
>>>>>>> were
>>>>>>> something like that it'd happen all the time.
>>>>>>>
>>>>>> Maybe it helps to know what value this extra byte in the FIFO 
>>>>>> has. Is it:
>>>>>> - a duplicate of the last read byte
>>>>>> - or the next byte (at <end address> + 1)
>>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>>> The values were up thread a bit but I'll repeat them here
>>>>>
>>>>> fsl_espi ffe110000.spi: tx 70
>>>>> fsl_espi ffe110000.spi: rx 03
>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>
>>>>>
>>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen 
>>>>> them with
>>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>>
>>>> Just remembered something about SPIE_DON:
>>>> Transfers are always full duplex, therefore in case of a read the chip
>>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>>> that the last dummy zero was shifted out.
>>>>
>>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 
>>>> byte in.
>>>> So the issue may have a dependency on the length of the transfer.
>>>> However I see no good explanation so far. You can try adding a 
>>>> delay of
>>>> a few miroseconds between the following to commands in 
>>>> fsl_espi_bufs().
>>>>
>>>>     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>>
>>>>     /* Prevent filling the fifo from getting interrupted */
>>>>     spin_lock_irq(&espi->lock);
>>>>
>>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too 
>>>> close.
>>> I think this might be heading in the right direction. Playing about 
>>> with
>>> a delay does seem to make the two symptoms less likely. Although I have
>>> to set it quite high (i.e. msleep(100)) to completely avoid any
>>> possibility of seeing either message.
>> The patch might replay the interrupt a little bit faster, but it would
>> be a few microseconds at most I think (just from improved code).
>>
>> Would you be able to ftrace the interrupt handler function and see if 
>> you
>> can see a difference in number or timing of interrupts? I'm at a bit of
>> a loss.
> This is getting really weird. I was setting up to run with ftrace and 
> found I couldn't reproduce it on the tip of Linus's tree (currently 
> pointing at e7a522c83b86). But I swear I could last week. Sure enough 
> if I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again.

*Sigh* my master branch still has the interrupt changes reverted.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-01  6:14                                         ` Nicholas Piggin
  2020-09-01 23:29                                           ` Chris Packham
@ 2020-09-03 23:58                                           ` Chris Packham
       [not found]                                             ` <CAFSsGVvRMQoEoBN1hpao_4uM1yF2wwuKPbMkAcwFWyE1X8HDbQ@mail.gmail.com>
  2020-09-07  9:53                                             ` Joakim Tjernlund
  1 sibling, 2 replies; 32+ messages in thread
From: Chris Packham @ 2020-09-03 23:58 UTC (permalink / raw)
  To: Nicholas Piggin, benh, broonie, Heiner Kallweit, mpe, paulus
  Cc: linux-kernel, linuxppc-dev, linux-spi


On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>> <snip>
>>>>>>
>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>
>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>
>>>>>>>>>>>>      From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>>
>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>>> patch is confirmed.
>>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>
>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>
>>>>>>> [snip patch]
>>>>>>>
>>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>> else has changed.
>>>>>>>
>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>> higher interrupt latency?
>>>>>>>
>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>>> up the full interrupt frame). But it's possible.
>>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>>> happen every time does throw doubt on it.
>>>>>>
>>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>>> TX only. However this wouldn't really make sense, because also for RX
>>>>> we program the frame length, and therefore want to be notified once the
>>>>> full frame was received. Also practical experience shows that SPI_DON
>>>>> is set also after RX-only transfers.
>>>>> Typical SPI NOR use case is that you write read command + start address,
>>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>>> we'd always end up with SPI_DON not being set.
>>>>>
>>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>>> something like that it'd happen all the time.
>>>>>>
>>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>>> - a duplicate of the last read byte
>>>>> - or the next byte (at <end address> + 1)
>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>> The values were up thread a bit but I'll repeat them here
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>>
>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>
>>> Just remembered something about SPIE_DON:
>>> Transfers are always full duplex, therefore in case of a read the chip
>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>> that the last dummy zero was shifted out.
>>>
>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>>> So the issue may have a dependency on the length of the transfer.
>>> However I see no good explanation so far. You can try adding a delay of
>>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>>
>>> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>
>>> 	/* Prevent filling the fifo from getting interrupted */
>>> 	spin_lock_irq(&espi->lock);
>>>
>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
>> I think this might be heading in the right direction. Playing about with
>> a delay does seem to make the two symptoms less likely. Although I have
>> to set it quite high (i.e. msleep(100)) to completely avoid any
>> possibility of seeing either message.
> The patch might replay the interrupt a little bit faster, but it would
> be a few microseconds at most I think (just from improved code).
>
> Would you be able to ftrace the interrupt handler function and see if you
> can see a difference in number or timing of interrupts? I'm at a bit of
> a loss.

I tried ftrace but I really wasn't sure what I was looking for. 
Capturing a "bad" case was pretty tricky. But I think I've identified a 
fix (I'll send it as a proper patch shortly). The gist is

diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
index 7e7c92cafdbb..cb120b68c0e2 100644
--- a/drivers/spi/spi-fsl-espi.c
+++ b/drivers/spi/spi-fsl-espi.c
@@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi 
*espi, u32 events)
  static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
  {
         struct fsl_espi *espi = context_data;
-       u32 events;
+       u32 events, mask;

         spin_lock(&espi->lock);

         /* Get interrupt events(tx/rx) */
         events = fsl_espi_read_reg(espi, ESPI_SPIE);
-       if (!events) {
+       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
+       if (!(events & mask)) {
                 spin_unlock(&espi->lock);
                 return IRQ_NONE;
         }

The SPIE register contains the TXCNT so events is pretty much always 
going to have something set. By checking events against what we've 
actually requested interrupts for we don't see any spurious events.

I've tested this on the T2080RDB and on our custom hardware and it seems 
to resolve the problem.


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
       [not found]                                             ` <CAFSsGVvRMQoEoBN1hpao_4uM1yF2wwuKPbMkAcwFWyE1X8HDbQ@mail.gmail.com>
@ 2020-09-06 21:03                                               ` Chris Packham
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Packham @ 2020-09-06 21:03 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Nicholas Piggin, benh, broonie, linux-kernel, linux-spi,
	linuxppc-dev, mpe, paulus

(resend as something decided to insert html, some context stripped)

On 5/09/20 5:23 am, Heiner Kallweit wrote:
> On Fri 4. Sep 2020 at 01:58, Chris Packham 
> <Chris.Packham@alliedtelesis.co.nz 
> <mailto:Chris.Packham@alliedtelesis.co.nz>> wrote:
>
>
<snip>
>
>
>     I tried ftrace but I really wasn't sure what I was looking for.
>
>     Capturing a "bad" case was pretty tricky. But I think I've
>     identified a
>
>     fix (I'll send it as a proper patch shortly). The gist is
>
>
>
>     diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>
>     index 7e7c92cafdbb..cb120b68c0e2 100644
>
>     --- a/drivers/spi/spi-fsl-espi.c
>
>     +++ b/drivers/spi/spi-fsl-espi.c
>
>     @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
>
>     *espi, u32 events)
>
>       static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>
>       {
>
>              struct fsl_espi *espi = context_data;
>
>     -       u32 events;
>
>     +       u32 events, mask;
>
>
>
>              spin_lock(&espi->lock);
>
>
>
>              /* Get interrupt events(tx/rx) */
>
>              events = fsl_espi_read_reg(espi, ESPI_SPIE);
>
>     -       if (!events) {
>
>     +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>
>     +       if (!(events & mask)) {
>
>                      spin_unlock(&espi->lock);
>
>                      return IRQ_NONE;
>
>              }
>
>
>
>     The SPIE register contains the TXCNT so events is pretty much always
>
>     going to have something set. By checking events against what we've
>
>     actually requested interrupts for we don't see any spurious events.
>
>
> Usually we shouldn’t receive interrupts we’re not interested in, 
> except the interrupt is shared.
My theory is that we get an interrupt to the core for RXT and another 
for DON. With the changes to the powerpc interrupt handling and the fact 
that fsl_espi_cpu_irq() doesn't actually look at the specific event bits 
means that sometimes both events are handled in the processing of the 
first interrupt and the second one trips the SPI_DON not set error.

There's an old comment "SPI bus sometimes got lost interrupts..." which 
makes me wonder if the interrupt handling change has fixed this original 
problem.

I still think the change makes sense regardless because the SPIE 
register has some counter fields in it.

> This leads to the question: is the SPI interrupt shared with another 
> device on your system?
It's not shared on either the custom board or the T2080RDB.
> Do you see spurious interrupts with the patch under 
> /proc/irq/(irq)/spurious?

Yes it looks like it

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 3126
unhandled 0
last_unhandled 0 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 1016
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 88391
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 72459
unhandled 2
last_unhandled 4294758632 ms
>
>
>
>     I've tested this on the T2080RDB and on our custom hardware and it
>     seems
>
>     to resolve the problem.
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-03 23:58                                           ` Chris Packham
       [not found]                                             ` <CAFSsGVvRMQoEoBN1hpao_4uM1yF2wwuKPbMkAcwFWyE1X8HDbQ@mail.gmail.com>
@ 2020-09-07  9:53                                             ` Joakim Tjernlund
  2020-09-07 15:38                                               ` Joakim Tjernlund
  1 sibling, 1 reply; 32+ messages in thread
From: Joakim Tjernlund @ 2020-09-07  9:53 UTC (permalink / raw)
  To: mpe, broonie, paulus, npiggin, Chris.Packham, benh, hkallweit1
  Cc: linuxppc-dev, linux-spi, linux-kernel

On Thu, 2020-09-03 at 23:58 +0000, Chris Packham wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> > Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> > > On 1/09/20 12:33 am, Heiner Kallweit wrote:
> > > > On 30.08.2020 23:59, Chris Packham wrote:
> > > > > On 31/08/20 9:41 am, Heiner Kallweit wrote:
> > > > > > On 30.08.2020 23:00, Chris Packham wrote:
> > > > > > > On 31/08/20 12:30 am, Nicholas Piggin wrote:
> > > > > > > > Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> > > > > > > <snip>
> > > > > > > 
> > > > > > > > > > > > > I've also now seen the RX FIFO not empty error on the T2080RDB
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With my current workaround of emptying the RX FIFO. It seems
> > > > > > > > > > > > > survivable. Interestingly it only ever seems to be 1 extra byte in the
> > > > > > > > > > > > > RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 70
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > 
> > > > > > > > > > > > >      From all the Micron SPI-NOR datasheets I've got access to it is
> > > > > > > > > > > > > possible to continually read the SR/FSR. But I've no idea why it
> > > > > > > > > > > > > happens some times and not others.
> > > > > > > > > > > > So I think I've got a reproduction and I think I've bisected the problem
> > > > > > > > > > > > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
> > > > > > > > > > > > C"). My day is just finishing now so I haven't applied too much scrutiny
> > > > > > > > > > > > to this result. Given the various rabbit holes I've been down on this
> > > > > > > > > > > > issue already I'd take this information with a good degree of skepticism.
> > > > > > > > > > > > 
> > > > > > > > > > > OK, so an easy test should be to re-test with a 5.4 kernel.
> > > > > > > > > > > It doesn't have yet the change you're referring to, and the fsl-espi driver
> > > > > > > > > > > is basically the same as in 5.7 (just two small changes in 5.7).
> > > > > > > > > > There's 6cc0c16d82f88 and maybe also other interrupt related patches
> > > > > > > > > > around this time that could affect book E, so it's good if that exact
> > > > > > > > > > patch is confirmed.
> > > > > > > > > My confirmation is basically that I can induce the issue in a 5.4 kernel
> > > > > > > > > by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
> > > > > > > > > 5.9-rc2 by reverting that one commit.
> > > > > > > > > 
> > > > > > > > > I both cases it's not exactly a clean cherry-pick/revert so I also
> > > > > > > > > confirmed the bisection result by building at 3282a3da25bd (which sees
> > > > > > > > > the issue) and the commit just before (which does not).
> > > > > > > > Thanks for testing, that confirms it well.
> > > > > > > > 
> > > > > > > > [snip patch]
> > > > > > > > 
> > > > > > > > > I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
> > > > > > > > > didn't report anything (either with or without the change above).
> > > > > > > > Okay, it was a bit of a shot in the dark. I still can't see what
> > > > > > > > else has changed.
> > > > > > > > 
> > > > > > > > What would cause this, a lost interrupt? A spurious interrupt? Or
> > > > > > > > higher interrupt latency?
> > > > > > > > 
> > > > > > > > I don't think the patch should cause significantly worse latency,
> > > > > > > > (it's supposed to be a bit better if anything because it doesn't set
> > > > > > > > up the full interrupt frame). But it's possible.
> > > > > > > My working theory is that the SPI_DON indication is all about the TX
> > > > > > > direction an now that the interrupts are faster we're hitting an error
> > > > > > > because there is still RX activity going on. Heiner disagrees with my
> > > > > > > interpretation of the SPI_DON indication and the fact that it doesn't
> > > > > > > happen every time does throw doubt on it.
> > > > > > > 
> > > > > > It's right that the eSPI spec can be interpreted that SPI_DON refers to
> > > > > > TX only. However this wouldn't really make sense, because also for RX
> > > > > > we program the frame length, and therefore want to be notified once the
> > > > > > full frame was received. Also practical experience shows that SPI_DON
> > > > > > is set also after RX-only transfers.
> > > > > > Typical SPI NOR use case is that you write read command + start address,
> > > > > > followed by a longer read. If the TX-only interpretation would be right,
> > > > > > we'd always end up with SPI_DON not being set.
> > > > > > 
> > > > > > > I can't really explain the extra RX byte in the fifo. We know how many
> > > > > > > bytes to expect and we pull that many from the fifo so it's not as if
> > > > > > > we're missing an interrupt causing us to skip the last byte. I've been
> > > > > > > looking for some kind of off-by-one calculation but again if it were
> > > > > > > something like that it'd happen all the time.
> > > > > > > 
> > > > > > Maybe it helps to know what value this extra byte in the FIFO has. Is it:
> > > > > > - a duplicate of the last read byte
> > > > > > - or the next byte (at <end address> + 1)
> > > > > > - or a fixed value, e.g. always 0x00 or 0xff
> > > > > The values were up thread a bit but I'll repeat them here
> > > > > 
> > > > > fsl_espi ffe110000.spi: tx 70
> > > > > fsl_espi ffe110000.spi: rx 03
> > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > 
> > > > > 
> > > > > The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
> > > > > either a READ_SR or a READ_FSR. Never a data read.
> > > > > 
> > > > Just remembered something about SPIE_DON:
> > > > Transfers are always full duplex, therefore in case of a read the chip
> > > > sends dummy zero's. Having said that in case of a read SPIE_DON means
> > > > that the last dummy zero was shifted out.
> > > > 
> > > > READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> > > > So the issue may have a dependency on the length of the transfer.
> > > > However I see no good explanation so far. You can try adding a delay of
> > > > a few miroseconds between the following to commands in fsl_espi_bufs().
> > > > 
> > > >     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
> > > > 
> > > >     /* Prevent filling the fifo from getting interrupted */
> > > >     spin_lock_irq(&espi->lock);
> > > > 
> > > > Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> > > I think this might be heading in the right direction. Playing about with
> > > a delay does seem to make the two symptoms less likely. Although I have
> > > to set it quite high (i.e. msleep(100)) to completely avoid any
> > > possibility of seeing either message.
> > The patch might replay the interrupt a little bit faster, but it would
> > be a few microseconds at most I think (just from improved code).
> > 
> > Would you be able to ftrace the interrupt handler function and see if you
> > can see a difference in number or timing of interrupts? I'm at a bit of
> > a loss.
> 
> I tried ftrace but I really wasn't sure what I was looking for.
> Capturing a "bad" case was pretty tricky. But I think I've identified a
> fix (I'll send it as a proper patch shortly). The gist is
> 
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> *espi, u32 events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>          struct fsl_espi *espi = context_data;
> -       u32 events;
> +       u32 events, mask;
> 
>          spin_lock(&espi->lock);
> 
>          /* Get interrupt events(tx/rx) */
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> -       if (!events) {
> +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> +       if (!(events & mask)) {
>                  spin_unlock(&espi->lock);
>                  return IRQ_NONE;
>          }
> 
> The SPIE register contains the TXCNT so events is pretty much always
> going to have something set. By checking events against what we've
> actually requested interrupts for we don't see any spurious events.
> 
> I've tested this on the T2080RDB and on our custom hardware and it seems
> to resolve the problem.
> 

I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
are after processing TX/RX. That looks a bit odd to me.

  Jocke

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fsl_espi errors on v5.7.15
  2020-09-07  9:53                                             ` Joakim Tjernlund
@ 2020-09-07 15:38                                               ` Joakim Tjernlund
  0 siblings, 0 replies; 32+ messages in thread
From: Joakim Tjernlund @ 2020-09-07 15:38 UTC (permalink / raw)
  To: mpe, broonie, paulus, npiggin, Chris.Packham, benh, hkallweit1
  Cc: linuxppc-dev, linux-spi, linux-kernel

[SNIP]
> > 

> > > Would you be able to ftrace the interrupt handler function and see if you
> > > can see a difference in number or timing of interrupts? I'm at a bit of
> > > a loss.
> > 
> > I tried ftrace but I really wasn't sure what I was looking for.
> > Capturing a "bad" case was pretty tricky. But I think I've identified a
> > fix (I'll send it as a proper patch shortly). The gist is
> > 
> > diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> > index 7e7c92cafdbb..cb120b68c0e2 100644
> > --- a/drivers/spi/spi-fsl-espi.c
> > +++ b/drivers/spi/spi-fsl-espi.c
> > @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> > *espi, u32 events)
> >   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
> >   {
> >          struct fsl_espi *espi = context_data;
> > -       u32 events;
> > +       u32 events, mask;
> > 
> >          spin_lock(&espi->lock);
> > 
> >          /* Get interrupt events(tx/rx) */
> >          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> > -       if (!events) {
> > +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> > +       if (!(events & mask)) {
> >                  spin_unlock(&espi->lock);
> >                  return IRQ_NONE;
> >          }
> > 
> > The SPIE register contains the TXCNT so events is pretty much always
> > going to have something set. By checking events against what we've
> > actually requested interrupts for we don't see any spurious events.
> > 
> > I've tested this on the T2080RDB and on our custom hardware and it seems
> > to resolve the problem.
> > 
> 
> I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
> are after processing TX/RX. That looks a bit odd to me.

I should have been more specific. I think you can loose IRQs as fsl_espi_irq() works now.
Consider this:
1) You get TX IRQ and enter fsl_espi_irq()
2) Enter fsl_espi_fill_tx_fifo() to process any chars until done.
3) Now you get one more TX IRQ
4) fsl_espi_irq() clear events -> IRQ from 3) is lost.

 Jocke

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-09-07 15:43 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14  2:48 fsl_espi errors on v5.7.15 Chris Packham
2020-08-14  6:19 ` Heiner Kallweit
2020-08-16 21:09   ` Chris Packham
2020-08-18 22:44     ` Chris Packham
2020-08-19  6:15       ` Heiner Kallweit
2020-08-19 21:08         ` Chris Packham
2020-08-24 22:04           ` Chris Packham
2020-08-25  3:54             ` Chris Packham
2020-08-25  7:22               ` Heiner Kallweit
2020-08-25 22:22                 ` Chris Packham
2020-08-26  1:48                   ` Chris Packham
2020-08-26  6:07                     ` Chris Packham
2020-08-26  6:38                       ` Heiner Kallweit
2020-08-26 22:26                         ` Chris Packham
2020-08-27  7:12                         ` Nicholas Piggin
2020-08-27 22:07                           ` Chris Packham
2020-08-30 12:30                             ` Nicholas Piggin
2020-08-30 13:55                               ` Heiner Kallweit
2020-08-30 21:00                               ` Chris Packham
2020-08-30 21:41                                 ` Heiner Kallweit
2020-08-30 21:59                                   ` Chris Packham
2020-08-31 12:33                                     ` Heiner Kallweit
2020-09-01  1:25                                       ` Chris Packham
2020-09-01  6:14                                         ` Nicholas Piggin
2020-09-01 23:29                                           ` Chris Packham
2020-09-01 23:31                                             ` Chris Packham
2020-09-03 23:58                                           ` Chris Packham
     [not found]                                             ` <CAFSsGVvRMQoEoBN1hpao_4uM1yF2wwuKPbMkAcwFWyE1X8HDbQ@mail.gmail.com>
2020-09-06 21:03                                               ` Chris Packham
2020-09-07  9:53                                             ` Joakim Tjernlund
2020-09-07 15:38                                               ` Joakim Tjernlund
2020-09-01  2:52                                       ` Chris Packham
2020-09-01  7:38                                         ` Heiner Kallweit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).