linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Konrad Dybcio <konrad.dybcio@somainline.org>
To: Sricharan Ramabadhran <sricharan@codeaurora.org>,
	Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Manivannan Sadhasivam <mani@kernel.org>,
	pragalla@codeaurora.org, ~postmarketos/upstreaming@lists.sr.ht,
	martin.botka@somainline.org,
	angelogioacchino.delregno@somainline.org,
	marijn.suijten@somainline.org, jamipkettunen@somainline.org,
	Richard Weinberger <richard@nod.at>,
	Vignesh Raghavendra <vigneshr@ti.com>,
	linux-mtd@lists.infradead.org, linux-arm-msm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mdalam@codeaurora.org,
	bbhatt@codeaurora.org, hemantk@codeaurora.org
Subject: Re: [PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID
Date: Tue, 8 Feb 2022 17:45:48 +0100	[thread overview]
Message-ID: <6b839237-74f0-7270-2f33-f5c17e6b59de@somainline.org> (raw)
In-Reply-To: <cc1302f4-9150-0145-421c-bf2b7a7bf258@codeaurora.org>



On 4.02.2022 18:17, Sricharan Ramabadhran wrote:
> 
> On 2/2/2022 12:54 PM, Sricharan Ramabadhran wrote:
>> Hi Konrad/Miquel,
>>
>> On 2/1/2022 9:21 PM, Konrad Dybcio wrote:
>>>
>>> On 01/02/2022 14:52, Miquel Raynal wrote:
>>>> Hi Konrad,
>>>>
>>>> konrad.dybcio@somainline.org wrote on Mon, 31 Jan 2022 20:54:12 +0100:
>>>>
>>>>> On 31/01/2022 15:13, Sricharan Ramabadhran wrote:
>>>>>> Hi Konrad,
>>>>>>
>>>>>> On 1/31/2022 3:39 PM, Konrad Dybcio wrote:
>>>>>>> On 28/01/2022 18:50, Sricharan Ramabadhran wrote:
>>>>>>>> Hi Konrad,
>>>>>>>>
>>>>>>>> On 1/28/2022 9:55 AM, Sricharan Ramabadhran wrote:
>>>>>>>>> Hi Miquel,
>>>>>>>>>
>>>>>>>>> On 1/26/2022 4:12 PM, Miquel Raynal wrote:
>>>>>>>>>> Hi Mani,
>>>>>>>>>>
>>>>>>>>>> mani@kernel.org wrote on Wed, 26 Jan 2022 16:03:16 +0530:
>>>>>>>>>>> On Wed, Jan 26, 2022 at 11:16:13AM +0100, Miquel Raynal wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> miquel.raynal@bootlin.com wrote on Fri, 14 Jan 2022 08:27:18 +0100:
>>>>>>>>>>>>> Hi Konrad,
>>>>>>>>>>>>>
>>>>>>>>>>>>> konrad.dybcio@somainline.org wrote on Thu, 13 Jan 2022 19:44:26 >>>>>>>> +0100:
>>>>>>>>>>>>>> While I have absolutely 0 idea why and how, running >>>>>>>>> clear_bam_transaction
>>>>>>>>>>>>>> when READID is issued makes the DMA totally clog up and refuse >>>>>>>>> to function
>>>>>>>>>>>>>> at all on mdm9607. In fact, it is so bad that all the data >>>>>>>>> gets garbled
>>>>>>>>>>>>>> and after a short while in the nand probe flow, the CPU >>>>>>>>> decides that
>>>>>>>>>>>>>> sepuku is the only option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Removing _READID from the if condition makes it work like a >>>>>>>>> charm, I can
>>>>>>>>>>>>>> read data and mount partitions without a problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@somainline.org>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> This is totally just an observation which took me an inhumane >>>>>>>>> amount of
>>>>>>>>>>>>>> debug prints to find.. perhaps there's a better reason behind >>>>>>>>> this, but
>>>>>>>>>>>>>> I can't seem to find any answers.. Therefore, this is a BIG RFC!
>>>>>>>>>>>>> I'm adding two people from codeaurora who worked a lot on this >>>>>>>> driver.
>>>>>>>>>>>>> Hopefully they will have an idea :)
>>>>>>>>>>>> Sadre, I've spent a significant amount of time reviewing your >>>>>>> patches,
>>>>>>>>>>>> now it's your turn to not take a month to answer to your peers
>>>>>>>>>>>> proposals.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help reviewing this patch.
>>>>>>>>>>> Sorry. I was hoping that Qcom folks would chime in as I don't >>>>>> have any idea
>>>>>>>>>>> about the mdm9607 platform. It could be that the mail server >>>>>> migration from
>>>>>>>>>>> codeaurora to quicinc put a barrier here.
>>>>>>>>>>>
>>>>>>>>>>> Let me ping them internally.
>>>>>>>>>> Oh, ok, I didn't know. Thanks!
>>>>>>>>>     Sorry Miquel, somehow we did not get this email in our inbox.
>>>>>>>>>     Thanks to Mani for pinging us, we will test this up today and >>>> get back.
>>>>>>>>        While we could not reproduce this issue on our ipq boards (do >>> not have a mdm9607 right now) and
>>>>>>>>         issue does not look any obvious.
>>>>>>>>        can you please give the debug logs that you did for the above >>> stage by stage ?
>>>>>>> I won't have access to the board for about two weeks, sorry.
>>>>>>>
>>>>>>> When I get to it, I'll surely try to send you the logs, though there
>>>>>>>
>>>>>>> wasn't much more than just something jumping to who-knows-where
>>>>>>>
>>>>>>> after clear_bam_transaction was called, resulting in values >> associated with
>>>>>>>
>>>>>>> the NAND being all zeroed out in pr_err/_debug/etc.
>>>>>>>
>>>>>>      Ok sure. So was the READID command itself failing (or) the > subsequent one ?
>>>>>>     We can check which parameter reset by the clear_bam_transaction is > causing the
>>>>>>     failure.  Meanwhile, looping in Pradeep who has access to the > board, so in a better
>>>>>>     position to debug.
>>>>> I'm sorry I have so few details on hand, and no kernel tree (no access to that machine either, for now).
>>>>>
>>>>>
>>>>> I will try to describe to the best of my abilities what I recall.
>>>>>
>>>>>
>>>>> My methodology of making sure things don't go haywire was to print the oob size
>>>>>
>>>>> of our NAND basically every two lines of code (yes, i was very desperate at one point),
>>>>>
>>>>> as that was zeroed out when *the bug* happened,
>>>> This does look like a pointer error at some point and some kernel data
>>>> has been corrupted very badly by the driver.
>>>>
>>>>> leading to a kernel bug/panic/stall
>>>>>
>>>>> (can't recall what exactly it was, but it said something along the lines of "no support for
>>>>>
>>>>> oob size 0" and then it didn't fail graceully, leading to some bad jumps and ultimately
>>>>>
>>>>> a dead platform..)
>>>>>
>>>>>
>>>>> after hours of digging, I found out that everything goes fine until clear_bam_transaction is called,
>>>> Do you remember if this function was called for the first time when
>>>> this happened?
>>>
>>> I think so, if I recall correctly there are no more callers in this path, as readid is the first nand command executed in flash probe flow.
>>>
>>>
>>>
>>>>
>>>>> after that gets executed every nand op starts reading all zeroes (for example in JEDEC ID check)
>>>>>
>>>>> so I added the changes from this patch, and things magically started working... My suspicion is
>>>>>
>>>>> that the underlying FIFO isn't fully drained (is it a FIFO on 9607? bah, i work on too many socs at once)
>>>> I don't see it in the list of supported devices, what's the exact
>>>> compatible used?
>>>
>>> qcom,ipq4019-nand
>>>
>>>
>>>
>>>>
>>>>> and this function only makes Linux think it is, without actually draining it, and the leftover
>>>>>
>>>>> commands get executed with some parts of them getting overwritten, resulting in the
>>>>>
>>>>> famous garbage in - garbage out situation, but that's only a guesstimate..
>>>> I would bet for a non allocated bam-ish pointer that is reset to zero
>>>> in the clear_bam_transaction() helper.
>>>>
>>>> Can you get your hands on the board again?
>>>
>>> Sure, but as I mentioned previously, only in about 2 weeks, I can't really do any dev before then.. :(
>>>
>>>
>>>
>>>> It would be nice to check if the allocation always occurs before use,
>>>> and if yes on how much bytes.
>>>>
>>>> If the pointer is not dangling, then perhaps something else smashes
>>>> that pointer.
>>>
>>>
>>> Konrad
>>>
>>>>
>>>>> Do note this somehow worked fine on 5.11 and then broke on 5.12/13. I went as far as replacing most
>>>>>
>>>>> of the kernel with the updated/downgraded parts via git checkout (i tried many combinations),
>>>>>
>>>>> to no avail.. I even tried different compilers and optimization levels, thinking it could have been
>>>>>
>>>>> a codegen issue, but no luck either.
>>>>>
>>>>>
>>>>> I.. do understand this email is a total mess to read, as much as it was to write, but
>>>>>
>>>>> without access to my code and the machine itself I can't give you solid details, and
>>>>>
>>>>> the fact this situation is far from ordinary doesn't help either..
>>>>>
>>>>>
>>>>> The latest (ancient, not quite pretty, but probably working if my memory is correct) version of my patches
>>>>>
>>>>> for the mdm9607 is available at [1], I will push the new revision after I get access to the workstation.
>>>>>
>>   + few more who have access to the board.
>>
>>    Going by the description, for kernel corruption, we can try out a KASAN build.
>>    Since you have mentioned it worked till 5.11, you bisected the driver till 5.11 head and it worked ?
>>
>    Tried running a KASAN enabled image on IPQ board, but no luck. Nothing came out.
>    Only if someone with the board can help here, we can proceed
> 
> 
> Regards,
>   Sricharan
> 
I have the board with me again. Please tell me where do we start :)

Konrad

  reply	other threads:[~2022-02-08 16:46 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-13 18:44 [PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID Konrad Dybcio
2022-01-13 18:45 ` Konrad Dybcio
2022-01-14  7:27 ` Miquel Raynal
2022-01-26 10:16   ` Miquel Raynal
2022-01-26 10:33     ` Manivannan Sadhasivam
2022-01-26 10:42       ` Miquel Raynal
2022-01-26 11:36         ` Manivannan Sadhasivam
2022-01-28  4:25         ` Sricharan Ramabadhran
2022-01-28 17:50           ` Sricharan Ramabadhran
2022-01-31  9:52             ` Miquel Raynal
2022-01-31 10:09             ` Konrad Dybcio
2022-01-31 14:13               ` Sricharan Ramabadhran
2022-01-31 19:54                 ` Konrad Dybcio
2022-02-01 13:52                   ` Miquel Raynal
2022-02-01 15:51                     ` Konrad Dybcio
2022-02-02  7:24                       ` Sricharan Ramabadhran
2022-02-04 17:17                         ` Sricharan Ramabadhran
2022-02-08 16:45                           ` Konrad Dybcio [this message]
2022-02-24  7:33                             ` Sricharan Ramabadhran
2022-03-11 21:22                               ` Konrad Dybcio
2022-04-08 13:29                                 ` Manivannan Sadhasivam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6b839237-74f0-7270-2f33-f5c17e6b59de@somainline.org \
    --to=konrad.dybcio@somainline.org \
    --cc=angelogioacchino.delregno@somainline.org \
    --cc=bbhatt@codeaurora.org \
    --cc=hemantk@codeaurora.org \
    --cc=jamipkettunen@somainline.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mtd@lists.infradead.org \
    --cc=mani@kernel.org \
    --cc=marijn.suijten@somainline.org \
    --cc=martin.botka@somainline.org \
    --cc=mdalam@codeaurora.org \
    --cc=miquel.raynal@bootlin.com \
    --cc=pragalla@codeaurora.org \
    --cc=richard@nod.at \
    --cc=sricharan@codeaurora.org \
    --cc=vigneshr@ti.com \
    --cc=~postmarketos/upstreaming@lists.sr.ht \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).