All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
To: Heiner Kallweit <hkallweit1@gmail.com>,
	"broonie@kernel.org" <broonie@kernel.org>,
	"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
	"paulus@samba.org" <paulus@samba.org>
Cc: "linux-spi@vger.kernel.org" <linux-spi@vger.kernel.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15
Date: Wed, 26 Aug 2020 06:07:18 +0000	[thread overview]
Message-ID: <42107721-614b-96e8-68d9-4b888206562e@alliedtelesis.co.nz> (raw)
In-Reply-To: <519c3068-6c73-c17a-2016-1afe2a1d12f7@alliedtelesis.co.nz>


On 26/08/20 1:48 pm, Chris Packham wrote:
>
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> <snip>
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>> identified a couple of deficiencies that may or may not be related 
>>>> to my
>>>> issue.
>>>>
>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>> can be
>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>> register.
>>>> We also write back to it to clear the current events. We re-read it in
>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>> naturally end up in that situation if we're doing a large read. 
>>>> Consider
>>>> the messages for reading a block of data from a spi-nor chip
>>>>
>>>>    tx = READ_OP + ADDR
>>>>    rx = data
>>>>
>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>> goes
>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>> continues until we've received all the data and we finish with 
>>>> ESPI_SPIE
>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>> isn't set.
>>>>
>>>> The other deficiency is that we only get an interrupt when the 
>>>> amount of
>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>> the fifo.
>>>>
>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>>
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted 
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about 
>> the TX direction. But the fact that it doesn't happen all the time 
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>> chip can
>>>> fill the CPUs rx fifo.
>>>>
>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>> would also expect a slower frequency would fit my "DON is for TX" 
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could 
>>> be so
>>> special with your setup that it seems you're the only one being 
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue 
>> despite being in the field for going on 5 years) the issue only 
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>>
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>> impact but every time I found something that seemed to be having an 
>> impact I've been able to disprove it. I actually think its about how 
>> busy the system is which may or may not affect when we get round to 
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd 
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>                                 compatible = "micron,n25q512ax3", 
>> "jedec,spi-nor";
>>                                 reg = <0>;
>>                                 spi-max-frequency = <10000000>; /* 
>> input clock */
>> +
>> +                               partition@u-boot {
>> +                                        reg = <0x00000000 0x00100000>;
>> +                                        label = "u-boot";
>> +                                };
>> +                                partition@u-boot-env {
>> +                                        reg = <0x00100000 0x00010000>;
>> +                                        label = "u-boot-env";
>> +                                };
>>                         };
>>                 };
>>
>> And I'm using the following script to poke at the environment 
>> (warning if anyone does try this and the bug hits it can render your 
>> u-boot environment invalid).
>>
>> cat flash/fw_env_test.sh
>> #!/bin/sh
>>
>> generate_fw_env_config()
>> {
>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>> name ; do
>>      echo "$dev $size $erasesize $name"
>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>> $erasesize" >/flash/fw_env.config
>>   done
>> }
>>
>> cycles=10
>> [ $# -ge 1 ] && cycles=$1
>>
>> generate_fw_env_config
>>
>> fw_printenv -c /flash/fw_env.config
>>
>> dmesg -c >/dev/null
>> x=0
>> while [ $x -lt $cycles ]; do
>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>     dmesg -c | grep -q fsl_espi && break;
>>     let x=x+1
>> done
>>
>> echo "Ran $x cycles"
>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems 
> survivable. Interestingly it only ever seems to be 1 extra byte in the 
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
> From all the Micron SPI-NOR datasheets I've got access to it is 
> possible to continually read the SR/FSR. But I've no idea why it 
> happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem 
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
C"). My day is just finishing now so I haven't applied too much scrutiny 
to this result. Given the various rabbit holes I've been down on this 
issue already I'd take this information with a good degree of skepticism.

Thanks,
Chris

WARNING: multiple messages have this Message-ID (diff)
From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
To: Heiner Kallweit <hkallweit1@gmail.com>,
	"broonie@kernel.org" <broonie@kernel.org>,
	"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
	"paulus@samba.org" <paulus@samba.org>
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-spi@vger.kernel.org" <linux-spi@vger.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15
Date: Wed, 26 Aug 2020 06:07:18 +0000	[thread overview]
Message-ID: <42107721-614b-96e8-68d9-4b888206562e@alliedtelesis.co.nz> (raw)
In-Reply-To: <519c3068-6c73-c17a-2016-1afe2a1d12f7@alliedtelesis.co.nz>


On 26/08/20 1:48 pm, Chris Packham wrote:
>
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> <snip>
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>> identified a couple of deficiencies that may or may not be related 
>>>> to my
>>>> issue.
>>>>
>>>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>>>> can be
>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>>>> register.
>>>> We also write back to it to clear the current events. We re-read it in
>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>> naturally end up in that situation if we're doing a large read. 
>>>> Consider
>>>> the messages for reading a block of data from a spi-nor chip
>>>>
>>>>    tx = READ_OP + ADDR
>>>>    rx = data
>>>>
>>>> We setup the transfer and pump out the tx_buf. The first interrupt 
>>>> goes
>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>> continues until we've received all the data and we finish with 
>>>> ESPI_SPIE
>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>> isn't set.
>>>>
>>>> The other deficiency is that we only get an interrupt when the 
>>>> amount of
>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>>> the fifo.
>>>>
>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>>
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted 
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about 
>> the TX direction. But the fact that it doesn't happen all the time 
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
>>>> (or slow) the interrupts get processed and how fast the spi-nor 
>>>> chip can
>>>> fill the CPUs rx fifo.
>>>>
>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
>> would also expect a slower frequency would fit my "DON is for TX" 
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could 
>>> be so
>>> special with your setup that it seems you're the only one being 
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue 
>> despite being in the field for going on 5 years) the issue only 
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>>
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
>> impact but every time I found something that seemed to be having an 
>> impact I've been able to disprove it. I actually think its about how 
>> busy the system is which may or may not affect when we get round to 
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd 
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>                                 compatible = "micron,n25q512ax3", 
>> "jedec,spi-nor";
>>                                 reg = <0>;
>>                                 spi-max-frequency = <10000000>; /* 
>> input clock */
>> +
>> +                               partition@u-boot {
>> +                                        reg = <0x00000000 0x00100000>;
>> +                                        label = "u-boot";
>> +                                };
>> +                                partition@u-boot-env {
>> +                                        reg = <0x00100000 0x00010000>;
>> +                                        label = "u-boot-env";
>> +                                };
>>                         };
>>                 };
>>
>> And I'm using the following script to poke at the environment 
>> (warning if anyone does try this and the bug hits it can render your 
>> u-boot environment invalid).
>>
>> cat flash/fw_env_test.sh
>> #!/bin/sh
>>
>> generate_fw_env_config()
>> {
>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize 
>> name ; do
>>      echo "$dev $size $erasesize $name"
>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
>> $erasesize" >/flash/fw_env.config
>>   done
>> }
>>
>> cycles=10
>> [ $# -ge 1 ] && cycles=$1
>>
>> generate_fw_env_config
>>
>> fw_printenv -c /flash/fw_env.config
>>
>> dmesg -c >/dev/null
>> x=0
>> while [ $x -lt $cycles ]; do
>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>     dmesg -c | grep -q fsl_espi && break;
>>     let x=x+1
>> done
>>
>> echo "Ran $x cycles"
>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems 
> survivable. Interestingly it only ever seems to be 1 extra byte in the 
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
> From all the Micron SPI-NOR datasheets I've got access to it is 
> possible to continually read the SR/FSR. But I've no idea why it 
> happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem 
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in 
C"). My day is just finishing now so I haven't applied too much scrutiny 
to this result. Given the various rabbit holes I've been down on this 
issue already I'd take this information with a good degree of skepticism.

Thanks,
Chris

  reply	other threads:[~2020-08-26  6:07 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-14  2:48 fsl_espi errors on v5.7.15 Chris Packham
2020-08-14  2:48 ` Chris Packham
2020-08-14  6:19 ` Heiner Kallweit
2020-08-14  6:19   ` Heiner Kallweit
2020-08-16 21:09   ` Chris Packham
2020-08-16 21:09     ` Chris Packham
2020-08-18 22:44     ` Chris Packham
2020-08-18 22:44       ` Chris Packham
2020-08-19  6:15       ` Heiner Kallweit
2020-08-19  6:15         ` Heiner Kallweit
2020-08-19 21:08         ` Chris Packham
2020-08-19 21:08           ` Chris Packham
2020-08-24 22:04           ` Chris Packham
2020-08-24 22:04             ` Chris Packham
2020-08-25  3:54             ` Chris Packham
2020-08-25  3:54               ` Chris Packham
2020-08-25  7:22               ` Heiner Kallweit
2020-08-25  7:22                 ` Heiner Kallweit
2020-08-25 22:22                 ` Chris Packham
2020-08-25 22:22                   ` Chris Packham
2020-08-26  1:48                   ` Chris Packham
2020-08-26  1:48                     ` Chris Packham
2020-08-26  6:07                     ` Chris Packham [this message]
2020-08-26  6:07                       ` Chris Packham
2020-08-26  6:38                       ` Heiner Kallweit
2020-08-26  6:38                         ` Heiner Kallweit
2020-08-26 22:26                         ` Chris Packham
2020-08-26 22:26                           ` Chris Packham
2020-08-27  7:12                         ` Nicholas Piggin
2020-08-27  7:12                           ` Nicholas Piggin
2020-08-27 22:07                           ` Chris Packham
2020-08-27 22:07                             ` Chris Packham
2020-08-30 12:30                             ` Nicholas Piggin
2020-08-30 12:30                               ` Nicholas Piggin
2020-08-30 13:55                               ` Heiner Kallweit
2020-08-30 13:55                                 ` Heiner Kallweit
2020-08-30 21:00                               ` Chris Packham
2020-08-30 21:00                                 ` Chris Packham
2020-08-30 21:41                                 ` Heiner Kallweit
2020-08-30 21:41                                   ` Heiner Kallweit
2020-08-30 21:59                                   ` Chris Packham
2020-08-30 21:59                                     ` Chris Packham
2020-08-31 12:33                                     ` Heiner Kallweit
2020-08-31 12:33                                       ` Heiner Kallweit
2020-09-01  1:25                                       ` Chris Packham
2020-09-01  1:25                                         ` Chris Packham
2020-09-01  6:14                                         ` Nicholas Piggin
2020-09-01  6:14                                           ` Nicholas Piggin
2020-09-01 23:29                                           ` Chris Packham
2020-09-01 23:29                                             ` Chris Packham
2020-09-01 23:31                                             ` Chris Packham
2020-09-01 23:31                                               ` Chris Packham
2020-09-03 23:58                                           ` Chris Packham
2020-09-03 23:58                                             ` Chris Packham
2020-09-04 17:23                                             ` Heiner Kallweit
2020-09-06 20:59                                               ` Chris Packham
2020-09-06 21:03                                               ` Chris Packham
2020-09-06 21:03                                                 ` Chris Packham
2020-09-07  9:53                                             ` Joakim Tjernlund
2020-09-07  9:53                                               ` Joakim Tjernlund
2020-09-07 15:38                                               ` Joakim Tjernlund
2020-09-07 15:38                                                 ` Joakim Tjernlund
2020-09-01  2:52                                       ` Chris Packham
2020-09-01  2:52                                         ` Chris Packham
2020-09-01  7:38                                         ` Heiner Kallweit
2020-09-01  7:38                                           ` Heiner Kallweit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42107721-614b-96e8-68d9-4b888206562e@alliedtelesis.co.nz \
    --to=chris.packham@alliedtelesis.co.nz \
    --cc=benh@kernel.crashing.org \
    --cc=broonie@kernel.org \
    --cc=hkallweit1@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.