All of lore.kernel.org
 help / color / mirror / Atom feed
From: <Tudor.Ambarus@microchip.com>
To: <peda@axentia.se>, <saravanak@google.com>
Cc: <alexandre.belloni@bootlin.com>, <gregkh@linuxfoundation.org>,
	<linux-kernel@vger.kernel.org>, <du@axentia.se>,
	<Ludovic.Desroches@microchip.com>,
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: Regression: memory corruption on Atmel SAMA5D31
Date: Mon, 7 Mar 2022 09:45:18 +0000	[thread overview]
Message-ID: <e47c155a-f25d-11b3-3339-b2bf71b886ce@microchip.com> (raw)
In-Reply-To: <3beb80ad-d6f8-8c6a-e17d-e40a644bba07@microchip.com>

On 3/4/22 18:48, Tudor.Ambarus@microchip.com wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 3/4/22 14:38, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> Hi!
> 
> Hi, Peter!
> 
>>
>> On 2022-03-04 12:12, Tudor.Ambarus@microchip.com wrote:
>>> Hi, Peter!
>>>
>>> On 3/4/22 12:57, Peter Rosin wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On 2022-03-04 07:57, Peter Rosin wrote:
>>>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@axentia.se> wrote:
>>>>>>>
>>>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@axentia.se> wrote:
>>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>>>> things to try in order to track down what's going on. I have
>>>>>>>>> bisected the issue to
>>>>>>>>>
>>>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>>>
>>>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>>>> Asking because of:
>>>>>>>> https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@huawei.com/
>>>>>>>>
>>>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>>>
>>>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>>>> shows a difference driver issues would start becoming a real
>>>>>>>> possibility.
>>>>>>>>
>>>>>>>> -Saravana
>>>>>>>
>>>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>>>> have. I.e.
>>>>>>>
>>>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>>>
>>>>>>> It would have been typical if an issue that existed for a couple of
>>>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>>>
>>>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>>>
>>>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>>>
>>>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>>>> do not have to wait for long to get a bad hash with the first
>>>>>>> reproducer
>>>>>>>
>>>>>>>    while :; do cat testfile | sha256sum; done
>>>>>>>
>>>>>>> The output is typical:
>>>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215  -
>>>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570  -
>>>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3  -
>>>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645  -
>>>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>>
>>>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>>>
>>>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>>>> line and you still saw the corruption.
>>>>>
>>>>> Yes. On a bad kernel it's the same with all of the following kernel
>>>>> command lines.
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>>> If that's the case, I can't see how this could possibly have anything
>>>>>> to do with:
>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>
>>>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>>>> there, the rest of the code in the series doesn't run because more
>>>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>>>> in the entire series. The only remaining diff would be header file
>>>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>>>> issues here because I'm either deleting fields that aren't used or
>>>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>>>> patch was just causing enough timing changes that it's masking the
>>>>>> real issue.
>>>>>
>>>>> When I compare fw_devlink_link_device() from before and after
>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>> I notice that you also removed an unconditional call to
>>>>> device_link_add_missing_supplier_links() that was live before,
>>>>> regardless of any fw_devlink parameter.
>>>>>
>>>>> I don't know if that's relevant. Is it?
>>>>>
>>>>> Not knowing this code at all, and without any serious attempt
>>>>> at reading it, from here the comment of that removed function
>>>>> sure looks like it might cause a different ordering before and
>>>>> after the patch that is not restored with any fw_devlink
>>>>> argument.
>>>>
>>>> It appears that the device_link_add_missing_supplier_links() difference
>>>> is not relevant after all. What actually happened in the header file in
>>>> the "bad" commit was that two fields were removed (none added). Like so:
>>>>
>>>>  struct dev_links_info {
>>>>         struct list_head suppliers;
>>>>         struct list_head consumers;
>>>> -       struct list_head needs_suppliers;
>>>>         struct list_head defer_sync;
>>>> -       bool need_for_probe;
>>>>         enum dl_dev_state status;
>>>>  };
>>>>
>>>> If I restore those fields on a bad kernel, the issue is no longer
>>>> visible. That is true for the first bad kernel, i.e.
>>>>
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> and for tip of Linus as of recently, i.e.
>>>>
>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>
>>>> Which is of course insane and a whole different level of bad. WTF!?!
>>>>
>>>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>>>> there? I think that's next on the list...
>>>>
>>>
>>> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
>>> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
>>> I couldn't reproduce the bug. I'm using Linus's latest kernel:
>>> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>>>
>>> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
>>> 200000+0 records in
>>> 200000+0 records out
>>> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
>>> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> root@sama5d3-xplained-sd:~#
>>>
>>> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
>>> in parallel to have more interrupts on the system. Will let you know if I can
>>> reproduce the bug on sama5d3_xplained.
>>
>> Thanks for testing!
> 
> you're welcome, no worries.
>>
>> Since you (probably) don't have the interrupt source from the USB
>> serial chip that I have, that is not completely unexpected.
>>
>> $ lsusb
>> Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
>> Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
>> $ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
>> 1
>> 1
>> 1
>> 1
>>
>> Also, your file is perhaps too small? You leave approx 50MB for the
>> system, so it might be the case that the page cache can hold the whole
>> file?
>>
>> So, can you please try that again with a slightly bigger file or if you
>> restrict how much RAM you allow the kernel to see?
>>
>> And if you don't have the FTDI usb-serial chip, you should probably go
>> with the other reproducer, namely to simply copy the random file to a
>> different host using scp.
> 
> I kept the rootfs on sdcard but this time I generated a 300MB random file.
> I ran a mtd_stresstest on the NAND flash while doing the sha256sum or scp
> tests. All went fine.
> 
> Here's the mtd_stresstest being successful https://pastebin.com/eWQNHAsE
> While the stresstest was running I did the following sha256 and scp tests:
> https://pastebin.com/wjutw63C
> 
> On my laptop the sha256sum is matching the one on the board:
> $ sha256sum /tmp/testfile?
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile1
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile2
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile3
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile4
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile5
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile6
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile7
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile8
> 
> Here's what "top" cmd was showing when doing the scp and the mtd_stresstest:
> top - 14:40:13 up 39 min,  3 users,  load average: 1.95, 1.88, 1.80
> Tasks:  54 total,   3 running,  51 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 35.1 us, 48.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 16.9 si,  0.0 st
> MiB Mem :    242.3 total,      2.5 free,     15.2 used,    224.6 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.    220.1 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   464 root      20   0    4296   3292   2940 R  46.6   1.3   0:17.53 ssh
>   401 root      20   0    1668    760    676 R  45.0   0.3  17:57.11 modprobe
>   463 root      20   0    3456   2232   2000 S   5.2   0.9   0:02.04 scp
> 
> Here's what "top" cmd was showing when doing the sha256sum and the mtd_stresstest:
> top - 14:12:47 up 12 min,  3 users,  load average: 2.14, 1.92, 1.08
> Tasks:  54 total,   3 running,  51 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 37.4 us, 58.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  4.2 si,  0.0 st
> MiB Mem :    242.3 total,      3.0 free,     14.8 used,    224.5 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.    220.6 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   420 root      20   0    1396    784    692 R  47.2   0.3   0:06.42 sha256sum
>   401 root      20   0    1668   1208   1124 R  43.0   0.5   4:50.34 modprobe
>   419 root      20   0    1520    868    680 S   6.5   0.3   0:00.92 cat
> 
> Peter, do you think it is worth to do some other tests on sama5d3_xplained?
> I'll try to find a SAMA5D31 evaluation kit meanwhile.
> 

Peter, would it worth to do on your board a similar test to what I did?
I'm thinking whether the source of interrupts matters or not. So can you
disable your USB and use a mtd NAND stress test as a source of interrupts?
mtd_stresstest together with scp or hexdump.

Cheers,
ta
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: <Tudor.Ambarus@microchip.com>
To: <peda@axentia.se>, <saravanak@google.com>
Cc: <alexandre.belloni@bootlin.com>, <gregkh@linuxfoundation.org>,
	<linux-kernel@vger.kernel.org>, <du@axentia.se>,
	<Ludovic.Desroches@microchip.com>,
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: Regression: memory corruption on Atmel SAMA5D31
Date: Mon, 7 Mar 2022 09:45:18 +0000	[thread overview]
Message-ID: <e47c155a-f25d-11b3-3339-b2bf71b886ce@microchip.com> (raw)
In-Reply-To: <3beb80ad-d6f8-8c6a-e17d-e40a644bba07@microchip.com>

On 3/4/22 18:48, Tudor.Ambarus@microchip.com wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 3/4/22 14:38, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> Hi!
> 
> Hi, Peter!
> 
>>
>> On 2022-03-04 12:12, Tudor.Ambarus@microchip.com wrote:
>>> Hi, Peter!
>>>
>>> On 3/4/22 12:57, Peter Rosin wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On 2022-03-04 07:57, Peter Rosin wrote:
>>>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@axentia.se> wrote:
>>>>>>>
>>>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@axentia.se> wrote:
>>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>>>> things to try in order to track down what's going on. I have
>>>>>>>>> bisected the issue to
>>>>>>>>>
>>>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>>>
>>>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>>>> Asking because of:
>>>>>>>> https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@huawei.com/
>>>>>>>>
>>>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>>>
>>>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>>>> shows a difference driver issues would start becoming a real
>>>>>>>> possibility.
>>>>>>>>
>>>>>>>> -Saravana
>>>>>>>
>>>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>>>> have. I.e.
>>>>>>>
>>>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>>>
>>>>>>> It would have been typical if an issue that existed for a couple of
>>>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>>>
>>>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>>>
>>>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>>>
>>>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>>>> do not have to wait for long to get a bad hash with the first
>>>>>>> reproducer
>>>>>>>
>>>>>>>    while :; do cat testfile | sha256sum; done
>>>>>>>
>>>>>>> The output is typical:
>>>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215  -
>>>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570  -
>>>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3  -
>>>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645  -
>>>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe  -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>>>>>>>
>>>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>>>
>>>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>>>> line and you still saw the corruption.
>>>>>
>>>>> Yes. On a bad kernel it's the same with all of the following kernel
>>>>> command lines.
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>>> If that's the case, I can't see how this could possibly have anything
>>>>>> to do with:
>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>
>>>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>>>> there, the rest of the code in the series doesn't run because more
>>>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>>>> in the entire series. The only remaining diff would be header file
>>>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>>>> issues here because I'm either deleting fields that aren't used or
>>>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>>>> patch was just causing enough timing changes that it's masking the
>>>>>> real issue.
>>>>>
>>>>> When I compare fw_devlink_link_device() from before and after
>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>> I notice that you also removed an unconditional call to
>>>>> device_link_add_missing_supplier_links() that was live before,
>>>>> regardless of any fw_devlink parameter.
>>>>>
>>>>> I don't know if that's relevant. Is it?
>>>>>
>>>>> Not knowing this code at all, and without any serious attempt
>>>>> at reading it, from here the comment of that removed function
>>>>> sure looks like it might cause a different ordering before and
>>>>> after the patch that is not restored with any fw_devlink
>>>>> argument.
>>>>
>>>> It appears that the device_link_add_missing_supplier_links() difference
>>>> is not relevant after all. What actually happened in the header file in
>>>> the "bad" commit was that two fields were removed (none added). Like so:
>>>>
>>>>  struct dev_links_info {
>>>>         struct list_head suppliers;
>>>>         struct list_head consumers;
>>>> -       struct list_head needs_suppliers;
>>>>         struct list_head defer_sync;
>>>> -       bool need_for_probe;
>>>>         enum dl_dev_state status;
>>>>  };
>>>>
>>>> If I restore those fields on a bad kernel, the issue is no longer
>>>> visible. That is true for the first bad kernel, i.e.
>>>>
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> and for tip of Linus as of recently, i.e.
>>>>
>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>
>>>> Which is of course insane and a whole different level of bad. WTF!?!
>>>>
>>>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>>>> there? I think that's next on the list...
>>>>
>>>
>>> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
>>> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
>>> I couldn't reproduce the bug. I'm using Linus's latest kernel:
>>> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>>>
>>> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
>>> 200000+0 records in
>>> 200000+0 records out
>>> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
>>> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de  -
>>> root@sama5d3-xplained-sd:~#
>>>
>>> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
>>> in parallel to have more interrupts on the system. Will let you know if I can
>>> reproduce the bug on sama5d3_xplained.
>>
>> Thanks for testing!
> 
> you're welcome, no worries.
>>
>> Since you (probably) don't have the interrupt source from the USB
>> serial chip that I have, that is not completely unexpected.
>>
>> $ lsusb
>> Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
>> Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
>> $ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
>> 1
>> 1
>> 1
>> 1
>>
>> Also, your file is perhaps too small? You leave approx 50MB for the
>> system, so it might be the case that the page cache can hold the whole
>> file?
>>
>> So, can you please try that again with a slightly bigger file or if you
>> restrict how much RAM you allow the kernel to see?
>>
>> And if you don't have the FTDI usb-serial chip, you should probably go
>> with the other reproducer, namely to simply copy the random file to a
>> different host using scp.
> 
> I kept the rootfs on sdcard but this time I generated a 300MB random file.
> I ran a mtd_stresstest on the NAND flash while doing the sha256sum or scp
> tests. All went fine.
> 
> Here's the mtd_stresstest being successful https://pastebin.com/eWQNHAsE
> While the stresstest was running I did the following sha256 and scp tests:
> https://pastebin.com/wjutw63C
> 
> On my laptop the sha256sum is matching the one on the board:
> $ sha256sum /tmp/testfile?
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile1
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile2
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile3
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile4
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile5
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile6
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile7
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58  /tmp/testfile8
> 
> Here's what "top" cmd was showing when doing the scp and the mtd_stresstest:
> top - 14:40:13 up 39 min,  3 users,  load average: 1.95, 1.88, 1.80
> Tasks:  54 total,   3 running,  51 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 35.1 us, 48.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 16.9 si,  0.0 st
> MiB Mem :    242.3 total,      2.5 free,     15.2 used,    224.6 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.    220.1 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   464 root      20   0    4296   3292   2940 R  46.6   1.3   0:17.53 ssh
>   401 root      20   0    1668    760    676 R  45.0   0.3  17:57.11 modprobe
>   463 root      20   0    3456   2232   2000 S   5.2   0.9   0:02.04 scp
> 
> Here's what "top" cmd was showing when doing the sha256sum and the mtd_stresstest:
> top - 14:12:47 up 12 min,  3 users,  load average: 2.14, 1.92, 1.08
> Tasks:  54 total,   3 running,  51 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 37.4 us, 58.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  4.2 si,  0.0 st
> MiB Mem :    242.3 total,      3.0 free,     14.8 used,    224.5 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.    220.6 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   420 root      20   0    1396    784    692 R  47.2   0.3   0:06.42 sha256sum
>   401 root      20   0    1668   1208   1124 R  43.0   0.5   4:50.34 modprobe
>   419 root      20   0    1520    868    680 S   6.5   0.3   0:00.92 cat
> 
> Peter, do you think it is worth to do some other tests on sama5d3_xplained?
> I'll try to find a SAMA5D31 evaluation kit meanwhile.
> 

Peter, would it worth to do on your board a similar test to what I did?
I'm thinking whether the source of interrupts matters or not. So can you
disable your USB and use a mtd NAND stress test as a source of interrupts?
mtd_stresstest together with scp or hexdump.

Cheers,
ta

  reply	other threads:[~2022-03-07  9:48 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-03  0:29 Regression: memory corruption on Atmel SAMA5D31 Peter Rosin
2022-03-03  3:02 ` Saravana Kannan
2022-03-03  3:02   ` Saravana Kannan
2022-03-03  9:17   ` Peter Rosin
2022-03-03  9:17     ` Peter Rosin
2022-03-04  3:55     ` Saravana Kannan
2022-03-04  3:55       ` Saravana Kannan
2022-03-04  6:57       ` Peter Rosin
2022-03-04  6:57         ` Peter Rosin
2022-03-04 10:57         ` Peter Rosin
2022-03-04 10:57           ` Peter Rosin
2022-03-04 11:12           ` Tudor.Ambarus
2022-03-04 11:12             ` Tudor.Ambarus
2022-03-04 12:38             ` Peter Rosin
2022-03-04 12:38               ` Peter Rosin
2022-03-04 16:48               ` Tudor.Ambarus
2022-03-04 16:48                 ` Tudor.Ambarus
2022-03-07  9:45                 ` Tudor.Ambarus [this message]
2022-03-07  9:45                   ` Tudor.Ambarus
2022-03-07 11:32                   ` Peter Rosin
2022-03-07 11:32                     ` Peter Rosin
2022-03-07 20:32                     ` Peter Rosin
2022-03-07 20:32                       ` Peter Rosin
2022-03-08  7:55                       ` Nicolas Ferre
2022-03-08  7:55                         ` Nicolas Ferre
2022-03-09  8:30                         ` Peter Rosin
2022-03-09  8:30                           ` Peter Rosin
     [not found]                           ` <6d9561a4-39e4-3dbe-5fe2-c6f88ee2a4c6@axentia.se>
     [not found]                             ` <ed24a281-1790-8e24-5f5a-25b66527044b@microchip.com>
     [not found]                               ` <d563c7ba-6431-2639-9f2a-2e2c6788e625@axentia.se>
     [not found]                                 ` <e5a715c5-ad9f-6fd4-071e-084ab950603e@microchip.com>
2022-03-10  9:58                                   ` Peter Rosin
2022-03-10  9:58                                     ` Peter Rosin
2022-03-10 10:40                                     ` Peter Rosin
2022-03-10 10:40                                       ` Peter Rosin
2022-04-09 13:02                                       ` Thorsten Leemhuis
2022-04-09 13:02                                         ` Thorsten Leemhuis
2022-04-11  6:21                                         ` Tudor.Ambarus
2022-04-11  6:21                                           ` Tudor.Ambarus
2022-05-17 14:50                                           ` Peter Rosin
2022-05-17 14:50                                             ` Peter Rosin
2022-05-18  6:21                                             ` Tudor.Ambarus
2022-05-18  6:21                                               ` Tudor.Ambarus
2022-05-18  7:51                                               ` Peter Rosin
2022-05-18  7:51                                                 ` Peter Rosin
2022-06-20  7:04                                                 ` Thorsten Leemhuis
2022-06-20  7:04                                                   ` Thorsten Leemhuis
2022-06-20  8:43                                                   ` Tudor.Ambarus
2022-06-20  8:43                                                     ` Tudor.Ambarus
2022-06-20 14:22                                                     ` Tudor.Ambarus
2022-06-20 14:22                                                       ` Tudor.Ambarus
2022-06-21  7:00                                                       ` Peter Rosin
2022-06-21  7:00                                                         ` Peter Rosin
2022-06-21 10:46                                                       ` Peter Rosin
2022-06-21 10:46                                                         ` Peter Rosin
2022-06-27 12:26                                                         ` Tudor.Ambarus
2022-06-27 12:26                                                           ` Tudor.Ambarus
2022-06-27 16:53                                                           ` Tudor.Ambarus
2022-06-27 16:53                                                             ` Tudor.Ambarus
2022-06-30  5:20                                                             ` Peter Rosin
2022-06-30  5:20                                                               ` Peter Rosin
2022-06-30  9:23                                                               ` Tudor.Ambarus
2022-06-30  9:23                                                                 ` Tudor.Ambarus
2022-06-30 10:20                                                                 ` Tudor.Ambarus
2022-06-30 10:20                                                                   ` Tudor.Ambarus
2022-07-13 16:01                                                             ` Tudor.Ambarus
2022-07-13 16:01                                                               ` Tudor.Ambarus
2022-07-28  7:45                                                               ` Tudor.Ambarus
2022-07-28  7:45                                                                 ` Tudor.Ambarus
2022-07-28  8:39                                                                 ` Tudor.Ambarus
2022-07-28  8:39                                                                   ` Tudor.Ambarus
2022-07-29 20:09                                                                   ` Peter Rosin
2022-07-29 20:09                                                                     ` Peter Rosin
2022-07-30 11:37                                                                     ` Peter Rosin
2022-07-30 11:37                                                                       ` Peter Rosin
2022-07-31  3:44                                                                       ` Tudor.Ambarus
2022-07-31  3:44                                                                         ` Tudor.Ambarus
2022-03-04 20:06           ` Saravana Kannan
2022-03-04 20:06             ` Saravana Kannan
2022-03-04  8:00 ` Thorsten Leemhuis
2022-03-04  8:00   ` Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e47c155a-f25d-11b3-3339-b2bf71b886ce@microchip.com \
    --to=tudor.ambarus@microchip.com \
    --cc=Ludovic.Desroches@microchip.com \
    --cc=alexandre.belloni@bootlin.com \
    --cc=du@axentia.se \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peda@axentia.se \
    --cc=saravanak@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.