netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Null pointer dereference in mcp251x driver when resuming from sleep
@ 2021-05-03 13:11 Frieder Schrempf
  2021-05-03 13:44 ` Andy Shevchenko
  0 siblings, 1 reply; 7+ messages in thread
From: Frieder Schrempf @ 2021-05-03 13:11 UTC (permalink / raw)
  To: Marc Kleine-Budde, linux-can
  Cc: Wolfgang Grandegger, David S. Miller, Jakub Kicinski,
	Liam Girdwood, Mark Brown, Vincent Mailhol, Oliver Hartkopp,
	Timo Schlüßler, Andy Shevchenko, Tim Harvey, netdev,
	linux-kernel

Hi,

with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference 
exception from the mcp251x driver when I resume from sleep (see trace 
below).

As far as I can tell this was working fine with 5.4. As I currently 
don't have the time to do further debugging/bisecting, for now I want to 
at least report this here.

Maybe there is someone around who could already give a wild guess for 
what might cause this just by looking at the trace/code!?

Thanks
Frieder

[   32.626311] PM: suspend entry (deep)
[   32.630030] Filesystems sync: 0.000 seconds
[   32.635031] Freezing user space processes ... (elapsed 0.039 seconds) 
done.
[   32.681296] OOM killer disabled.
[   32.684542] Freezing remaining freezable tasks ... (elapsed 0.001 
seconds) done.
[   32.814861] Disabling non-boot CPUs ...
[   32.819277] CPU1: shutdown
[   32.822002] psci: CPU1 killed (polled 0 ms)
[   32.827052] CPU2: shutdown
[   32.829772] psci: CPU2 killed (polled 0 ms)
[   32.834859] CPU3: shutdown
[   32.837582] psci: CPU3 killed (polled 0 ms)
[   32.842362] Enabling non-boot CPUs ...
[   32.846629] Detected VIPT I-cache on CPU1
[   32.846653] GICv3: CPU1: found redistributor 1 region 
0:0x00000000388a0000
[   32.846707] CPU1: Booted secondary processor 0x0000000001 [0x410fd034]
[   32.847202] CPU1 is up
[   32.867394] Detected VIPT I-cache on CPU2
[   32.867411] GICv3: CPU2: found redistributor 2 region 
0:0x00000000388c0000
[   32.867440] CPU2: Booted secondary processor 0x0000000002 [0x410fd034]
[   32.867777] CPU2 is up
[   32.887937] Detected VIPT I-cache on CPU3
[   32.887954] GICv3: CPU3: found redistributor 3 region 
0:0x00000000388e0000
[   32.887984] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
[   32.888328] CPU3 is up
[   32.912371] Unable to handle kernel NULL pointer dereference at 
virtual address 0000000000000100
[   32.921186] Mem abort info:
[   32.923980]   ESR = 0x96000004
[   32.927035]   EC = 0x25: DABT (current EL), IL = 32 bits
[   32.932349]   SET = 0, FnV = 0
[   32.935403]   EA = 0, S1PTW = 0
[   32.938545] Data abort info:
[   32.941425]   ISV = 0, ISS = 0x00000004
[   32.945261]   CM = 0, WnR = 0
[   32.948229] user pgtable: 4k pages, 48-bit VAs, pgdp=000000004310b000
[   32.954672] [0000000000000100] pgd=0000000000000000, p4d=0000000000000000
[   32.961469] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   32.967045] Modules linked in:
[   32.970104] CPU: 0 PID: 624 Comm: sh Not tainted 
5.12.1-ktn+g807a88195d76 #1
[   32.977158] Hardware name: Kontron i.MX8MM N801X S LVDS (DT)
[   32.982820] pstate: 00000085 (nzcv daIf -PAN -UAO -TCO BTYPE=--)
[   32.988830] pc : __queue_work+0x28/0x3e0
[   32.992767] lr : queue_work_on+0x54/0x80
[   32.996694] sp : ffff8000128c3a30
[   33.000008] x29: ffff8000128c3a30 x28: ffff000002798a10
[   33.005327] x27: 0000000000000100 x26: 0000000000000000
[   33.010644] x25: ffff800010fc2408 x24: ffff800011371a7c
[   33.015961] x23: ffff8000114d5178 x22: ffff000002739880
[   33.021278] x21: 0000000000000100 x20: 0000000000000000
[   33.026596] x19: 0000000000000000 x18: 0000000000000010
[   33.031913] x17: 0000000000000000 x16: 0000000000000001
[   33.037230] x15: 0000000000000011 x14: ffff800010d2e4a0
[   33.042547] x13: 0000000000001002 x12: 0000000000000011
[   33.047864] x11: 0000000000000040 x10: ffff8000113e2068
[   33.053182] x9 : ffff8000113e2060 x8 : ffff000001c00270
[   33.058499] x7 : 0000000000000000 x6 : 0000000000000197
[   33.063816] x5 : 0000000000000000 x4 : 0000000000000001
[   33.069133] x3 : 0000000000000000 x2 : ffff000002798a10
[   33.074450] x1 : 0000000000000000 x0 : 0000000000000100
[   33.079767] Call trace:
[   33.082214]  __queue_work+0x28/0x3e0
[   33.085794]  queue_work_on+0x54/0x80
[   33.089373]  mcp251x_can_resume+0x94/0xb8
[   33.093388]  dpm_run_callback.isra.0+0x20/0x78
[   33.097839]  device_resume+0x78/0x160
[   33.101505]  dpm_resume+0xc0/0x1e8
[   33.104909]  dpm_resume_end+0x18/0x30
[   33.108573]  suspend_devices_and_enter+0x23c/0x4d8
[   33.113369]  pm_suspend+0x1e4/0x268
[   33.116861]  state_store+0x8c/0x118
[   33.120352]  kobj_attr_store+0x18/0x30
[   33.124108]  sysfs_kf_write+0x44/0x58
[   33.127776]  kernfs_fop_write_iter+0x118/0x1a8
[   33.132223]  new_sync_write+0xe8/0x188
[   33.135978]  vfs_write+0x254/0x388
[   33.139384]  ksys_write+0x6c/0xf8
[   33.142702]  __arm64_sys_write+0x1c/0x28
[   33.146629]  el0_svc_common.constprop.0+0x60/0x120
[   33.151427]  do_el0_svc+0x24/0x90
[   33.154745]  el0_svc+0x24/0x38
[   33.157807]  el0_sync_handler+0xb0/0xb8
[   33.161645]  el0_sync+0x174/0x180
[   33.164968] Code: 2a0003f5 a90573fb 2a0003fb aa0203fc (b9410020)
[   33.171066] ---[ end trace b4f771b250a07a74 ]---

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-03 13:11 Null pointer dereference in mcp251x driver when resuming from sleep Frieder Schrempf
@ 2021-05-03 13:44 ` Andy Shevchenko
  2021-05-03 13:48   ` Andy Shevchenko
  2021-05-03 13:49   ` Frieder Schrempf
  0 siblings, 2 replies; 7+ messages in thread
From: Andy Shevchenko @ 2021-05-03 13:44 UTC (permalink / raw)
  To: Frieder Schrempf
  Cc: Marc Kleine-Budde, linux-can, Wolfgang Grandegger,
	David S. Miller, Jakub Kicinski, Liam Girdwood, Mark Brown,
	Vincent Mailhol, Oliver Hartkopp, Timo Schlüßler,
	Tim Harvey, netdev, linux-kernel

On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> Hi,
> 
> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> exception from the mcp251x driver when I resume from sleep (see trace
> below).
> 
> As far as I can tell this was working fine with 5.4. As I currently don't
> have the time to do further debugging/bisecting, for now I want to at least
> report this here.
> 
> Maybe there is someone around who could already give a wild guess for what
> might cause this just by looking at the trace/code!?

Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-03 13:44 ` Andy Shevchenko
@ 2021-05-03 13:48   ` Andy Shevchenko
  2021-05-03 13:54     ` Andy Shevchenko
  2021-05-03 13:49   ` Frieder Schrempf
  1 sibling, 1 reply; 7+ messages in thread
From: Andy Shevchenko @ 2021-05-03 13:48 UTC (permalink / raw)
  To: Frieder Schrempf
  Cc: Marc Kleine-Budde, linux-can, Wolfgang Grandegger,
	David S. Miller, Jakub Kicinski, Liam Girdwood, Mark Brown,
	Vincent Mailhol, Oliver Hartkopp, Timo Schlüßler,
	Tim Harvey, netdev, linux-kernel

On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > Hi,
> > 
> > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > exception from the mcp251x driver when I resume from sleep (see trace
> > below).
> > 
> > As far as I can tell this was working fine with 5.4. As I currently don't
> > have the time to do further debugging/bisecting, for now I want to at least
> > report this here.
> > 
> > Maybe there is someone around who could already give a wild guess for what
> > might cause this just by looking at the trace/code!?
> 
> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

Other than that, bisecting will take not more than 3-4 iterations only:
% git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
e0e25001d088 can: mcp251x: add support for half duplex controllers
74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
2d52dabbef60 can: mcp251x: add GPIO support
cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
df561f6688fe treewide: Use fallthrough pseudo-keyword
8ce8c0abcba3 can: mcp251x: only reset hardware as required
877a902103fd can: mcp251x: add mcp251x_write_2regs() and make use of it
50ec88120ea1 can: mcp251x: get rid of legacy platform data
14684b93019a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-03 13:44 ` Andy Shevchenko
  2021-05-03 13:48   ` Andy Shevchenko
@ 2021-05-03 13:49   ` Frieder Schrempf
  1 sibling, 0 replies; 7+ messages in thread
From: Frieder Schrempf @ 2021-05-03 13:49 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Marc Kleine-Budde, linux-can, Wolfgang Grandegger,
	David S. Miller, Jakub Kicinski, Liam Girdwood, Mark Brown,
	Vincent Mailhol, Oliver Hartkopp, Timo Schlüßler,
	Tim Harvey, netdev, linux-kernel

On 03.05.21 15:44, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
>> Hi,
>>
>> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
>> exception from the mcp251x driver when I resume from sleep (see trace
>> below).
>>
>> As far as I can tell this was working fine with 5.4. As I currently don't
>> have the time to do further debugging/bisecting, for now I want to at least
>> report this here.
>>
>> Maybe there is someone around who could already give a wild guess for what
>> might cause this just by looking at the trace/code!?
> 
> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

This commit is so new, that it is neither in 5.10.x nor in 5.12.1. So it 
can't be the reason.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-03 13:48   ` Andy Shevchenko
@ 2021-05-03 13:54     ` Andy Shevchenko
  2021-05-04 13:54       ` Frieder Schrempf
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Shevchenko @ 2021-05-03 13:54 UTC (permalink / raw)
  To: Frieder Schrempf
  Cc: Marc Kleine-Budde, linux-can, Wolfgang Grandegger,
	David S. Miller, Jakub Kicinski, Liam Girdwood, Mark Brown,
	Vincent Mailhol, Oliver Hartkopp, Timo Schlüßler,
	Tim Harvey, netdev, linux-kernel

On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> > On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > > Hi,
> > > 
> > > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > > exception from the mcp251x driver when I resume from sleep (see trace
> > > below).
> > > 
> > > As far as I can tell this was working fine with 5.4. As I currently don't
> > > have the time to do further debugging/bisecting, for now I want to at least
> > > report this here.
> > > 
> > > Maybe there is someone around who could already give a wild guess for what
> > > might cause this just by looking at the trace/code!?
> > 
> > Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
> 
> Other than that, bisecting will take not more than 3-4 iterations only:
> % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
> 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
> e0e25001d088 can: mcp251x: add support for half duplex controllers
> 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
> 2d52dabbef60 can: mcp251x: add GPIO support
> cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
> df561f6688fe treewide: Use fallthrough pseudo-keyword

> 8ce8c0abcba3 can: mcp251x: only reset hardware as required

And only smoking gun by analyzing the code is the above. So, for the first I
would simply check before that commit and immediately after (15-30 minutes of
work). (I would do it myself if I had a hardware at hand...)

> 877a902103fd can: mcp251x: add mcp251x_write_2regs() and make use of it
> 50ec88120ea1 can: mcp251x: get rid of legacy platform data
> 14684b93019a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-03 13:54     ` Andy Shevchenko
@ 2021-05-04 13:54       ` Frieder Schrempf
  2021-05-04 14:19         ` Andy Shevchenko
  0 siblings, 1 reply; 7+ messages in thread
From: Frieder Schrempf @ 2021-05-04 13:54 UTC (permalink / raw)
  To: Andy Shevchenko, Timo Schlüßler, Marc Kleine-Budde
  Cc: linux-can, Wolfgang Grandegger, David S. Miller, Jakub Kicinski,
	Liam Girdwood, Mark Brown, Vincent Mailhol, Oliver Hartkopp,
	Tim Harvey, netdev, linux-kernel

On 03.05.21 15:54, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
>> On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
>>> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
>>>> Hi,
>>>>
>>>> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
>>>> exception from the mcp251x driver when I resume from sleep (see trace
>>>> below).
>>>>
>>>> As far as I can tell this was working fine with 5.4. As I currently don't
>>>> have the time to do further debugging/bisecting, for now I want to at least
>>>> report this here.
>>>>
>>>> Maybe there is someone around who could already give a wild guess for what
>>>> might cause this just by looking at the trace/code!?
>>>
>>> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
>>
>> Other than that, bisecting will take not more than 3-4 iterations only:
>> % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
>> 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
>> e0e25001d088 can: mcp251x: add support for half duplex controllers
>> 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
>> 2d52dabbef60 can: mcp251x: add GPIO support
>> cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
>> df561f6688fe treewide: Use fallthrough pseudo-keyword
> 
>> 8ce8c0abcba3 can: mcp251x: only reset hardware as required
> 
> And only smoking gun by analyzing the code is the above. So, for the first I
> would simply check before that commit and immediately after (15-30 minutes of
> work). (I would do it myself if I had a hardware at hand...)

Thanks for pointing that out. Indeed when I revert this commit it works 
fine again.

When I look at the change I see that queue_work(priv->wq, 
&priv->restart_work) is called in two cases, when the interface is 
brought up after resume and now also when the device is only powered up 
after resume but the interface stays down.

The latter is a problem if the device was never brought up before, as 
the workqueue is only allocated and initialized in mcp251x_open().

To me it looks like a proper fix would be to just move the workqueue 
init to the probe function to make sure it is available when resuming 
even if the interface was never up before.

I will try this and send a patch if it looks good.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Null pointer dereference in mcp251x driver when resuming from sleep
  2021-05-04 13:54       ` Frieder Schrempf
@ 2021-05-04 14:19         ` Andy Shevchenko
  0 siblings, 0 replies; 7+ messages in thread
From: Andy Shevchenko @ 2021-05-04 14:19 UTC (permalink / raw)
  To: Frieder Schrempf
  Cc: Timo Schlüßler, Marc Kleine-Budde, linux-can,
	Wolfgang Grandegger, David S. Miller, Jakub Kicinski,
	Liam Girdwood, Mark Brown, Vincent Mailhol, Oliver Hartkopp,
	Tim Harvey, netdev, linux-kernel

On Tue, May 04, 2021 at 03:54:00PM +0200, Frieder Schrempf wrote:
> On 03.05.21 15:54, Andy Shevchenko wrote:
> > On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
> > > On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> > > > On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > > > > Hi,
> > > > > 
> > > > > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > > > > exception from the mcp251x driver when I resume from sleep (see trace
> > > > > below).
> > > > > 
> > > > > As far as I can tell this was working fine with 5.4. As I currently don't
> > > > > have the time to do further debugging/bisecting, for now I want to at least
> > > > > report this here.
> > > > > 
> > > > > Maybe there is someone around who could already give a wild guess for what
> > > > > might cause this just by looking at the trace/code!?
> > > > 
> > > > Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
> > > 
> > > Other than that, bisecting will take not more than 3-4 iterations only:
> > > % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
> > > 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
> > > e0e25001d088 can: mcp251x: add support for half duplex controllers
> > > 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
> > > 2d52dabbef60 can: mcp251x: add GPIO support
> > > cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
> > > df561f6688fe treewide: Use fallthrough pseudo-keyword
> > 
> > > 8ce8c0abcba3 can: mcp251x: only reset hardware as required
> > 
> > And only smoking gun by analyzing the code is the above. So, for the first I
> > would simply check before that commit and immediately after (15-30 minutes of
> > work). (I would do it myself if I had a hardware at hand...)
> 
> Thanks for pointing that out. Indeed when I revert this commit it works fine
> again.
> 
> When I look at the change I see that queue_work(priv->wq,
> &priv->restart_work) is called in two cases, when the interface is brought
> up after resume and now also when the device is only powered up after resume
> but the interface stays down.
> 
> The latter is a problem if the device was never brought up before, as the
> workqueue is only allocated and initialized in mcp251x_open().
> 
> To me it looks like a proper fix would be to just move the workqueue init to
> the probe function to make sure it is available when resuming even if the
> interface was never up before.
> 
> I will try this and send a patch if it looks good.

Sounds like a plan!

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-04 14:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 13:11 Null pointer dereference in mcp251x driver when resuming from sleep Frieder Schrempf
2021-05-03 13:44 ` Andy Shevchenko
2021-05-03 13:48   ` Andy Shevchenko
2021-05-03 13:54     ` Andy Shevchenko
2021-05-04 13:54       ` Frieder Schrempf
2021-05-04 14:19         ` Andy Shevchenko
2021-05-03 13:49   ` Frieder Schrempf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).