From: Saravana Kannan <saravanak@google.com> To: Peter Rosin <peda@axentia.se> Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, Nicolas Ferre <Nicolas.Ferre@microchip.com>, Alexandre Belloni <alexandre.belloni@bootlin.com>, Ludovic Desroches <Ludovic.Desroches@microchip.com>, Daniels Umanovskis <du@axentia.se>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Subject: Re: Regression: memory corruption on Atmel SAMA5D31 Date: Thu, 3 Mar 2022 19:55:42 -0800 [thread overview] Message-ID: <CAGETcx8Bppn1y3Hffp2N_DPcJA6YyMEv1EFDTa1e1zOrkxbxzw@mail.gmail.com> (raw) In-Reply-To: <69bb004f-0bb4-ec56-479c-5deab0ece00f@axentia.se> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@axentia.se> wrote: > > On 2022-03-03 04:02, Saravana Kannan wrote: > > On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@axentia.se> wrote: > >> > >> Hi! > >> > >> I'm seeing a weird problem, and I'd like some help with further > >> things to try in order to track down what's going on. I have > >> bisected the issue to > >> > >> f9aa460672c9 ("driver core: Refactor fw_devlink feature") > > > > I skimmed through your email and I'll read it more closely tomorrow, > > but it wasn't clear if you see this on Linus's tip of the tree too. > > Asking because of: > > https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@huawei.com/ > > > > Also, a couple of other data points that _might_ help. Try kernel > > command line option fw_devlink=permissive vs fw_devlink=on (I forget > > if this was the default by 5.10) vs fw_devlink=off. > > > > I'm expecting "off" to fix the issue for you. But if permissive vs on > > shows a difference driver issues would start becoming a real > > possibility. > > > > -Saravana > > Thanks for the quick reply! I don't think I tested the very tip of > Linus tree before, only latest rc or something like that, but now I > have. I.e. > > 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace") > > It would have been typical if an issue that existed for a couple of > years had been fixed the last few weeks, but alas, no. > > On that kernel, and with whatever the default fw_devlink value is, the It's fw_devlink=on by default from at least 5.12-rc4 or so. > issue is there. It's a bit hard to tell if the incident probability > is the same when trying fw_devlink arguments, but roughly so, and I > do not have to wait for long to get a bad hash with the first > reproducer > > while :; do cat testfile | sha256sum; done > > The output is typical: > 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 - > 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 - > 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 - > d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 - > 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > > Setting fw_devlink=off makes no difference, AFAICT. By this, I'm assuming you set fw_devlink=off in the kernel command line and you still saw the corruption. If that's the case, I can't see how this could possibly have anything to do with: f9aa460672c9 ("driver core: Refactor fw_devlink feature") If you look at fw_devlink_link_device(), you'll see that the function is NOP if fw_devlink=off (the !fw_devlink_flags check). And from there, the rest of the code in the series doesn't run because more fields wouldn't get set, etc. That pretty much disables ALL the code in the entire series. The only remaining diff would be header file changes where I add/remove fields. But that's unlikely to cause any issues here because I'm either deleting fields that aren't used or adding fields that won't be used (with fw_devlink=off). I think the patch was just causing enough timing changes that it's masking the real issue. IIRC (it's been more than a year), the series [1] that brings in this patch has a few reverts. Those reverts undo subtle device probe ordering changes brought in by a bunch of earlier patches. You could go back to before those patches were added and see if you still see this corruption and then start bisecting from there. Basically try going to a point before: 42926ac3cd50 ("driver core: Move code to the right part of the file") TL;DR: is that since you are reproducing this with fw_devlink=off, I'm pretty sure the problem is not actually because of my changes or any changes related to fw_devlink. -Saravana [1] - https://lore.kernel.org/all/20201121020232.908850-1-saravanak@google.com/ > > So, just to double-check I went back to 5.11.22 with the two > mentioned patches reverted [1], plus an added backport of > > c73960bb0a43 ("gpiolib: allow line names from device props to override driver names") > > in order to make userspace behave as similarly as possible. > I left that running for an hour or so with 350-ish hashes > calculated correctly. Which is no proof that there is no latent > issue of course, but at the very least a great deal more stable > than later kernels. > > Cheers, > Peter > > [1] > f9aa460672c9 ("driver core: Refactor fw_devlink feature") > 2d09e6eb4a6f ("driver core: Delete pointless parameter in fwnode_operations.add_links") >
WARNING: multiple messages have this Message-ID (diff)
From: Saravana Kannan <saravanak@google.com> To: Peter Rosin <peda@axentia.se> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Daniels Umanovskis <du@axentia.se>, Ludovic Desroches <Ludovic.Desroches@microchip.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Subject: Re: Regression: memory corruption on Atmel SAMA5D31 Date: Thu, 3 Mar 2022 19:55:42 -0800 [thread overview] Message-ID: <CAGETcx8Bppn1y3Hffp2N_DPcJA6YyMEv1EFDTa1e1zOrkxbxzw@mail.gmail.com> (raw) In-Reply-To: <69bb004f-0bb4-ec56-479c-5deab0ece00f@axentia.se> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@axentia.se> wrote: > > On 2022-03-03 04:02, Saravana Kannan wrote: > > On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@axentia.se> wrote: > >> > >> Hi! > >> > >> I'm seeing a weird problem, and I'd like some help with further > >> things to try in order to track down what's going on. I have > >> bisected the issue to > >> > >> f9aa460672c9 ("driver core: Refactor fw_devlink feature") > > > > I skimmed through your email and I'll read it more closely tomorrow, > > but it wasn't clear if you see this on Linus's tip of the tree too. > > Asking because of: > > https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@huawei.com/ > > > > Also, a couple of other data points that _might_ help. Try kernel > > command line option fw_devlink=permissive vs fw_devlink=on (I forget > > if this was the default by 5.10) vs fw_devlink=off. > > > > I'm expecting "off" to fix the issue for you. But if permissive vs on > > shows a difference driver issues would start becoming a real > > possibility. > > > > -Saravana > > Thanks for the quick reply! I don't think I tested the very tip of > Linus tree before, only latest rc or something like that, but now I > have. I.e. > > 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace") > > It would have been typical if an issue that existed for a couple of > years had been fixed the last few weeks, but alas, no. > > On that kernel, and with whatever the default fw_devlink value is, the It's fw_devlink=on by default from at least 5.12-rc4 or so. > issue is there. It's a bit hard to tell if the incident probability > is the same when trying fw_devlink arguments, but roughly so, and I > do not have to wait for long to get a bad hash with the first > reproducer > > while :; do cat testfile | sha256sum; done > > The output is typical: > 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 - > 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 - > 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 - > d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 - > 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe - > 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - > > Setting fw_devlink=off makes no difference, AFAICT. By this, I'm assuming you set fw_devlink=off in the kernel command line and you still saw the corruption. If that's the case, I can't see how this could possibly have anything to do with: f9aa460672c9 ("driver core: Refactor fw_devlink feature") If you look at fw_devlink_link_device(), you'll see that the function is NOP if fw_devlink=off (the !fw_devlink_flags check). And from there, the rest of the code in the series doesn't run because more fields wouldn't get set, etc. That pretty much disables ALL the code in the entire series. The only remaining diff would be header file changes where I add/remove fields. But that's unlikely to cause any issues here because I'm either deleting fields that aren't used or adding fields that won't be used (with fw_devlink=off). I think the patch was just causing enough timing changes that it's masking the real issue. IIRC (it's been more than a year), the series [1] that brings in this patch has a few reverts. Those reverts undo subtle device probe ordering changes brought in by a bunch of earlier patches. You could go back to before those patches were added and see if you still see this corruption and then start bisecting from there. Basically try going to a point before: 42926ac3cd50 ("driver core: Move code to the right part of the file") TL;DR: is that since you are reproducing this with fw_devlink=off, I'm pretty sure the problem is not actually because of my changes or any changes related to fw_devlink. -Saravana [1] - https://lore.kernel.org/all/20201121020232.908850-1-saravanak@google.com/ > > So, just to double-check I went back to 5.11.22 with the two > mentioned patches reverted [1], plus an added backport of > > c73960bb0a43 ("gpiolib: allow line names from device props to override driver names") > > in order to make userspace behave as similarly as possible. > I left that running for an hour or so with 350-ish hashes > calculated correctly. Which is no proof that there is no latent > issue of course, but at the very least a great deal more stable > than later kernels. > > Cheers, > Peter > > [1] > f9aa460672c9 ("driver core: Refactor fw_devlink feature") > 2d09e6eb4a6f ("driver core: Delete pointless parameter in fwnode_operations.add_links") > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2022-03-04 3:56 UTC|newest] Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-03 0:29 Regression: memory corruption on Atmel SAMA5D31 Peter Rosin 2022-03-03 3:02 ` Saravana Kannan 2022-03-03 3:02 ` Saravana Kannan 2022-03-03 9:17 ` Peter Rosin 2022-03-03 9:17 ` Peter Rosin 2022-03-04 3:55 ` Saravana Kannan [this message] 2022-03-04 3:55 ` Saravana Kannan 2022-03-04 6:57 ` Peter Rosin 2022-03-04 6:57 ` Peter Rosin 2022-03-04 10:57 ` Peter Rosin 2022-03-04 10:57 ` Peter Rosin 2022-03-04 11:12 ` Tudor.Ambarus 2022-03-04 11:12 ` Tudor.Ambarus 2022-03-04 12:38 ` Peter Rosin 2022-03-04 12:38 ` Peter Rosin 2022-03-04 16:48 ` Tudor.Ambarus 2022-03-04 16:48 ` Tudor.Ambarus 2022-03-07 9:45 ` Tudor.Ambarus 2022-03-07 9:45 ` Tudor.Ambarus 2022-03-07 11:32 ` Peter Rosin 2022-03-07 11:32 ` Peter Rosin 2022-03-07 20:32 ` Peter Rosin 2022-03-07 20:32 ` Peter Rosin 2022-03-08 7:55 ` Nicolas Ferre 2022-03-08 7:55 ` Nicolas Ferre 2022-03-09 8:30 ` Peter Rosin 2022-03-09 8:30 ` Peter Rosin [not found] ` <6d9561a4-39e4-3dbe-5fe2-c6f88ee2a4c6@axentia.se> [not found] ` <ed24a281-1790-8e24-5f5a-25b66527044b@microchip.com> [not found] ` <d563c7ba-6431-2639-9f2a-2e2c6788e625@axentia.se> [not found] ` <e5a715c5-ad9f-6fd4-071e-084ab950603e@microchip.com> 2022-03-10 9:58 ` Peter Rosin 2022-03-10 9:58 ` Peter Rosin 2022-03-10 10:40 ` Peter Rosin 2022-03-10 10:40 ` Peter Rosin 2022-04-09 13:02 ` Thorsten Leemhuis 2022-04-09 13:02 ` Thorsten Leemhuis 2022-04-11 6:21 ` Tudor.Ambarus 2022-04-11 6:21 ` Tudor.Ambarus 2022-05-17 14:50 ` Peter Rosin 2022-05-17 14:50 ` Peter Rosin 2022-05-18 6:21 ` Tudor.Ambarus 2022-05-18 6:21 ` Tudor.Ambarus 2022-05-18 7:51 ` Peter Rosin 2022-05-18 7:51 ` Peter Rosin 2022-06-20 7:04 ` Thorsten Leemhuis 2022-06-20 7:04 ` Thorsten Leemhuis 2022-06-20 8:43 ` Tudor.Ambarus 2022-06-20 8:43 ` Tudor.Ambarus 2022-06-20 14:22 ` Tudor.Ambarus 2022-06-20 14:22 ` Tudor.Ambarus 2022-06-21 7:00 ` Peter Rosin 2022-06-21 7:00 ` Peter Rosin 2022-06-21 10:46 ` Peter Rosin 2022-06-21 10:46 ` Peter Rosin 2022-06-27 12:26 ` Tudor.Ambarus 2022-06-27 12:26 ` Tudor.Ambarus 2022-06-27 16:53 ` Tudor.Ambarus 2022-06-27 16:53 ` Tudor.Ambarus 2022-06-30 5:20 ` Peter Rosin 2022-06-30 5:20 ` Peter Rosin 2022-06-30 9:23 ` Tudor.Ambarus 2022-06-30 9:23 ` Tudor.Ambarus 2022-06-30 10:20 ` Tudor.Ambarus 2022-06-30 10:20 ` Tudor.Ambarus 2022-07-13 16:01 ` Tudor.Ambarus 2022-07-13 16:01 ` Tudor.Ambarus 2022-07-28 7:45 ` Tudor.Ambarus 2022-07-28 7:45 ` Tudor.Ambarus 2022-07-28 8:39 ` Tudor.Ambarus 2022-07-28 8:39 ` Tudor.Ambarus 2022-07-29 20:09 ` Peter Rosin 2022-07-29 20:09 ` Peter Rosin 2022-07-30 11:37 ` Peter Rosin 2022-07-30 11:37 ` Peter Rosin 2022-07-31 3:44 ` Tudor.Ambarus 2022-07-31 3:44 ` Tudor.Ambarus 2022-03-04 20:06 ` Saravana Kannan 2022-03-04 20:06 ` Saravana Kannan 2022-03-04 8:00 ` Thorsten Leemhuis 2022-03-04 8:00 ` Thorsten Leemhuis
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAGETcx8Bppn1y3Hffp2N_DPcJA6YyMEv1EFDTa1e1zOrkxbxzw@mail.gmail.com \ --to=saravanak@google.com \ --cc=Ludovic.Desroches@microchip.com \ --cc=Nicolas.Ferre@microchip.com \ --cc=alexandre.belloni@bootlin.com \ --cc=du@axentia.se \ --cc=gregkh@linuxfoundation.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=peda@axentia.se \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.