All of lore.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats"
@ 2016-01-08 12:18 Martin Witte
  2016-01-09 21:42 ` Ernst Martin Witte
  0 siblings, 1 reply; 5+ messages in thread
From: Martin Witte @ 2016-01-08 12:18 UTC (permalink / raw)
  To: Linux Media Mailing List

Hi!

First apologies if  this bug report is not perfect  ...  It's my first
time hunting a kernel bug and doing the bisect.

My way to reliably reproduce the issue:

  * relevance / use-case:

    HTPC going into sleep (currently  requires rmmod as isolated here)
    when not being used.

  * base system:
    - Ubuntu 15.10 (arch: amd64)
    - PCIe DVB-C card DVBSky T982 (2 DVB-C/T tuners)
    - starting point: Ubuntu kernel 4.2.x / x86_64
    
  * run tvheadend  4.x using the  DVB-C tuners (from git/master  in my
    case, but actual release should  not be that relevant) ...  likely
    any other DVB-C application would be sufficient.

  * trigger kernel panic:
  
    - stop tvheadend
    
    - rmmod cx23885
    
      Please note: It seems to be  relevant that the DVB-C tuners were
      in use (and released) before the call to rmmod.  During testing,
      I  had situations  where I  could successfully  do the  rmmod if
      tvheadend was not yet started after a fresh boot.

      The kernel  panic happens  after the rmmod  with a  little delay
      (less than a second) - in some completely unrelated parts of the
      kernel, but reproducible locations, e.g. a stack frame like:

          call_timer_fn+0x35/0xf0
          __run_timers (inlined)
          run_timer_softirq+0x221/0x2d0
          __do_softirq+0x0f6/0250

          (unable to handle paging request, aka invalid pointer)

      or

          call_timer_fn+0x35/0xf0
          __run_timers (inlined)
          run_timer_softirq+0x162/0x2d0
          __do_softirq+0x0f6/0250

          (null pointer dereference)

      The faulting line is a call to NULL or invalid function pointer,
      namely line 1178 in media_tree/kernel/time/timer.c

      1177         trace_timer_expire_entry(timer);
      1178         fn(data);
      1179         trace_timer_expire_exit(timer);

      fn is a  null pointer if called within the  "if (irqsafe)" block
      or an invalid pointer if  called from the respective else branch
      (ll. 1284 in media_tree/kernel/time/timer.c)

  * Affected:
  
       - v4.2.x
       
       - v4.3.x
       
       - v4.4.rcX
       
       - media-tree/master (after  workaround for  regression identified
                            by   Matthias   Schwarzott    on   Jan.   6,
                            "[REGRESSION/bisected]   kernel    oops   in
                            vb2_core_qbuf")
  
  * Not affected: v4.1.x

  * Bisect   result    on   git://github.com/torvalds/linux.git/master
    between v4.1 and v4.2
    
    2f1ea29fca781b8e6600f3ece1f2dd98ae276294 is the first bad commit
    commit 2f1ea29fca781b8e6600f3ece1f2dd98ae276294
    Author: Antti Palosaari <crope@iki.fi>
    Date:   Sun Sep 7 11:20:34 2014 -0300

        [media] si2157: implement signal strength stats
    
        Implement DVBv5 signal strength stats. Returns dBm.
    
        Signed-off-by: Antti Palosaari <crope@iki.fi>
        Tested-by: Adam Baker <linux@baker-net.org.uk>
        Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>

    :040000 040000 1fc70a3d18532f91289e8b581081ee59feefc321 5130e9b011e9c4ba683cd4db3eae8dca67c3ef0e M      drivers

  * Workaround:

    Reverting this patch on my tested kernels (incl. 4.4.rc8) prevents
    the kernel panic.

  * Guesses (Conclusion would be far too much ;-)

    - Memory  corruption by  some of  the media  drivers? Without  any
      personal background in linux  kernel development, it's currently
      my only idea how those  far-apart code locations can affect each
      other. If it is memory corruption - is it
      
      * caused by the identified patch? ... or
      
      * caused  by some  other  part  of the  media  drivers but  made
        "observable" by the identified patch?

    - Are there kernel  timers created by the media  drivers which are
      not cleaned up properly upon rmmmod?

    ...just few ideas (...  without any background on kernel dev ;-)


BR and many thx for your kernel development work!
  Martin



P.S.: Bisect log:

user@host:/a/b/c$ git bisect log
# bad: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2
# good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
git bisect start 'v4.2' 'v4.1'
# bad: [c11d716218910c3aa2bac1bb641e6086ad649555] Merge tag 'armsoc-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad c11d716218910c3aa2bac1bb641e6086ad649555
# good: [8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2] mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
git bisect good 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2
# good: [14738e03312ff1137109d68bcbf103c738af0f4a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect good 14738e03312ff1137109d68bcbf103c738af0f4a
# good: [4570a37169d4b44d316f40b2ccc681dc93fedc7b] Merge tag 'sound-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 4570a37169d4b44d316f40b2ccc681dc93fedc7b
# bad: [78aad7f81aa6dfccdb2804ac35db6fc371d265cf] [media] vivid-tpg: precalculate colorspace/xfer_func combinations
git bisect bad 78aad7f81aa6dfccdb2804ac35db6fc371d265cf
# good: [356484cabe44984d2dc66a90bd5e3465ba1f64fb] [media] dw2102: resync fifo when demod locks
git bisect good 356484cabe44984d2dc66a90bd5e3465ba1f64fb
# good: [e4aa18d33c3a05f9ac51a8c8c7863318c807650f] [media] DocBook: Improve the description of the properties API
git bisect good e4aa18d33c3a05f9ac51a8c8c7863318c807650f
# good: [e01dfc01914ab9a078ca8d08287c19c6663b5438] [media] videodev2.h: add COLORSPACE_DEFAULT
git bisect good e01dfc01914ab9a078ca8d08287c19c6663b5438
# good: [dc9ef7d11207a04514ca195f0c9f4d2ac56696e1] [media] DocBook media: rewrite frontend open/close
git bisect good dc9ef7d11207a04514ca195f0c9f4d2ac56696e1
# bad: [5ac417efe66ddd7cd70a98f7f4e32a14ae40a651] [media] sh_vou: avoid going past arrays
git bisect bad 5ac417efe66ddd7cd70a98f7f4e32a14ae40a651
# bad: [171fe6d1270d535eae798e4b5acc9f5d25e6e17e] [media] media: davinci_vpfe: set minimum required buffers to three
git bisect bad 171fe6d1270d535eae798e4b5acc9f5d25e6e17e
# good: [d2b72f6482b9a3c57f036c11786a2489dcc81176] [media] si2168: Implement own I2C adapter locking
git bisect good d2b72f6482b9a3c57f036c11786a2489dcc81176
# bad: [694f9963edd831e4ed6fdbcb7134525cf5715a79] [media] media: davinci_vpfe: clear the output_specs
git bisect bad 694f9963edd831e4ed6fdbcb7134525cf5715a79
# bad: [2f1ea29fca781b8e6600f3ece1f2dd98ae276294] [media] si2157: implement signal strength stats
git bisect bad 2f1ea29fca781b8e6600f3ece1f2dd98ae276294
# first bad commit: [2f1ea29fca781b8e6600f3ece1f2dd98ae276294] [media] si2157: implement signal strength stats

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats"
  2016-01-08 12:18 [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats" Martin Witte
@ 2016-01-09 21:42 ` Ernst Martin Witte
  2016-01-10  3:43   ` Antti Palosaari
  0 siblings, 1 reply; 5+ messages in thread
From: Ernst Martin Witte @ 2016-01-09 21:42 UTC (permalink / raw)
  To: Martin Witte; +Cc: Linux Media Mailing List

Hi again,

seems  that the  cause  for the  kernel  panic is  a  missing call  to
cancel_delayed_work_sync in  si2157_remove before  the call  to kfree.
After adding cancel_delayed_work_sync(&dev->stat_work), rmmod does not
trigger the kernel panic any more.

However, very similar issues could be identified also in other modules:

   ts2020
   af9013
   af9033
   rtl2830

when looking in drivers/media/tuners and drivers/media/dvb-frontends.

Therefore,  the submitted  patch  set contains  fixes  also for  those
modules. The submitted patch set is:

   [PATCH 0/5] [media] cancel_delayed_work_sync before device removal / kfree

I hope these patches completely fix the issue and are ok for inclusion
in the kernel.

BR and thx,
   Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats"
  2016-01-09 21:42 ` Ernst Martin Witte
@ 2016-01-10  3:43   ` Antti Palosaari
  2016-01-10 22:04     ` Ernst Martin Witte
  0 siblings, 1 reply; 5+ messages in thread
From: Antti Palosaari @ 2016-01-10  3:43 UTC (permalink / raw)
  To: emw, Martin Witte; +Cc: Linux Media Mailing List

Hello
Those timers are activated on frontend init() ja de-activated on 
sleep(). Removing active driver does not sound good and IMHO it should 
not be even possible. I think it should be find out why it is possible 
to remove whole driver before it is put to sleep().

Antti


On 01/09/2016 11:42 PM, Ernst Martin Witte wrote:
> Hi again,
>
> seems  that the  cause  for the  kernel  panic is  a  missing call  to
> cancel_delayed_work_sync in  si2157_remove before  the call  to kfree.
> After adding cancel_delayed_work_sync(&dev->stat_work), rmmod does not
> trigger the kernel panic any more.
>
> However, very similar issues could be identified also in other modules:
>
>     ts2020
>     af9013
>     af9033
>     rtl2830
>
> when looking in drivers/media/tuners and drivers/media/dvb-frontends.
>
> Therefore,  the submitted  patch  set contains  fixes  also for  those
> modules. The submitted patch set is:
>
>     [PATCH 0/5] [media] cancel_delayed_work_sync before device removal / kfree
>
> I hope these patches completely fix the issue and are ok for inclusion
> in the kernel.
>
> BR and thx,
>     Martin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
http://palosaari.fi/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats"
  2016-01-10  3:43   ` Antti Palosaari
@ 2016-01-10 22:04     ` Ernst Martin Witte
  2016-01-11  2:34       ` Antti Palosaari
  0 siblings, 1 reply; 5+ messages in thread
From: Ernst Martin Witte @ 2016-01-10 22:04 UTC (permalink / raw)
  To: Linux Media Mailing List, Antti Palosaari

Hi Antti!

Thx for  your quick answer. Since  I'm not familiar with  linux kernel
driver  internals,  may  I  ask  few  questions  in  order  to  better
understand the  intended use  of ..._delayed_work and  the interaction
between drivers like the cx23885 and the tuner modules?

- Is si2157_sleep not related to "suspended" system power state but to
  an "unused" state of the tuner physical device?

- Which  mechanism  should  call  si2157_sleep when  e.g.  cx23885  is
  rmmod'ed?

- What  I  understood   so  far  is  that  we  should   not  need  the
  cance_delayed_work_sync in ..._remove  because ..._sleep should have
  been called before.  Why is  such a cancel_delayed_work call already
  in  the  kernel  in  rtl2832_remove although  this  module  has  the
  separate cancel_delayed_work call in rtl23832_sleep?

- If si2157_sleep  was not called and  the device is considered  to be
  still   active,  which   mechanism  should   prevent  the   call  of
  si2157_remove or the unloading of module cx23885?

- Is it correct that rmmod cx23885  is legitime if none of its devices
  is in use by any app and if those "links" between si2157 and cx23885
  are  shut down  correctly?  ...   and  should only  be prevented  if
  something goes wrong with shutting down those "links"?

- Is the  interpretation correct  that si2157_remove should  be called
  when the  driver/module si2157  is unloaded?  If yes  - how  does it
  happen that it gets called when only an rmmod cx23885 is done?

- What    is   the    intended    call   sequence    of   all    those
  sleep/remove/... callbacks when

  a) an applications closes the device (but driver remains loaded, but
     presumingly inactive)

  b) the  one or other  module is  rmmod'ed (for me  personally mainly
     cx23885)

...  you might  see that  for me  it's mainly  unclear how  si2157 and
cx23885 are linked in terms of callbacks.

Is there anything  I could try (adding debug statements,  etc.) / test
in order to get more insight what's going wrong?

BR and many thanks for your help,
  Martin


On 10.01.2016 05:43, Antti Palosaari wrote:
> Hello
> Those timers are activated on frontend init() ja de-activated on sleep().
> Removing active driver does not sound good and IMHO it should not be even
> possible. I think it should be find out why it is possible to remove whole
> driver before it is put to sleep().
> 
> Antti
> 
> 
> On 01/09/2016 11:42 PM, Ernst Martin Witte wrote:
> >Hi again,
> >
> >seems  that the  cause  for the  kernel  panic is  a  missing call  to
> >cancel_delayed_work_sync in  si2157_remove before  the call  to kfree.
> >After adding cancel_delayed_work_sync(&dev->stat_work), rmmod does not
> >trigger the kernel panic any more.
> >
> >However, very similar issues could be identified also in other modules:
> >
> >    ts2020
> >    af9013
> >    af9033
> >    rtl2830
> >
> >when looking in drivers/media/tuners and drivers/media/dvb-frontends.
> >
> >Therefore,  the submitted  patch  set contains  fixes  also for  those
> >modules. The submitted patch set is:
> >
> >    [PATCH 0/5] [media] cancel_delayed_work_sync before device removal / kfree
> >
> >I hope these patches completely fix the issue and are ok for inclusion
> >in the kernel.
> >
> >BR and thx,
> >    Martin
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-media" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> -- 
> http://palosaari.fi/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats"
  2016-01-10 22:04     ` Ernst Martin Witte
@ 2016-01-11  2:34       ` Antti Palosaari
  0 siblings, 0 replies; 5+ messages in thread
From: Antti Palosaari @ 2016-01-11  2:34 UTC (permalink / raw)
  To: emw-linux-media-2016, Linux Media Mailing List

On 01/11/2016 12:04 AM, Ernst Martin Witte wrote:
> Hi Antti!
>
> Thx for  your quick answer. Since  I'm not familiar with  linux kernel
> driver  internals,  may  I  ask  few  questions  in  order  to  better
> understand the  intended use  of ..._delayed_work and  the interaction
> between drivers like the cx23885 and the tuner modules?
>
> - Is si2157_sleep not related to "suspended" system power state but to
>    an "unused" state of the tuner physical device?

It is physical chip state where is is not used.

> - Which  mechanism  should  call  si2157_sleep when  e.g.  cx23885  is
>    rmmod'ed?

It is dvb_frontend. You should not be able to remove cx23885 module 
until dvb frontend is stopped. dvb frontend runs thread in order to call 
init, sleep etc.

> - What  I  understood   so  far  is  that  we  should   not  need  the
>    cance_delayed_work_sync in ..._remove  because ..._sleep should have
>    been called before.  Why is  such a cancel_delayed_work call already
>    in  the  kernel  in  rtl2832_remove although  this  module  has  the
>    separate cancel_delayed_work call in rtl23832_sleep?

Look it again. There is 2 different delayed works.

> - If si2157_sleep  was not called and  the device is considered  to be
>    still   active,  which   mechanism  should   prevent  the   call  of
>    si2157_remove or the unloading of module cx23885?

cx23885 should ref count si2157. I think dvb frontend ref counts cx23885.

> - Is it correct that rmmod cx23885  is legitime if none of its devices
>    is in use by any app and if those "links" between si2157 and cx23885
>    are  shut down  correctly?  ...   and  should only  be prevented  if
>    something goes wrong with shutting down those "links"?
>
> - Is the  interpretation correct  that si2157_remove should  be called
>    when the  driver/module si2157  is unloaded?  If yes  - how  does it
>    happen that it gets called when only an rmmod cx23885 is done?

It should be called. cx23885 module releases si2157, which makes driver 
core to call si2157_remove in order to free all resources driver has.

There is 2 different kind of driver ops: one for DVB core and one for 
(I2C) driver core. si2157_remove is driver core op and si2157_sleep is 
coming from DVB core.

> - What    is   the    intended    call   sequence    of   all    those
>    sleep/remove/... callbacks when
>
>    a) an applications closes the device (but driver remains loaded, but
>       presumingly inactive)
>
>    b) the  one or other  module is  rmmod'ed (for me  personally mainly
>       cx23885)

Those dvb core callbacks are higher level and ones from struct 
i2c_driver are lower level. So in order to shutdown device dvb callbacks 
should be called first and after that i2c remove. After I2C remove none 
of callbacks are not valid anymore.

>
> ...  you might  see that  for me  it's mainly  unclear how  si2157 and
> cx23885 are linked in terms of callbacks.
>
> Is there anything  I could try (adding debug statements,  etc.) / test
> in order to get more insight what's going wrong?

Sure, just add manually some printing to see where code is running. 
That's how I also do.

You should examine when dvb core calls sleep and test it is not possible 
to remove cx23885 module until that. And if sleep is not called, why it 
is not called. IIRC there is some module param to prevent dvb frontend 
shutdown, but it should not be used (I think it is left there to 
workaround some bug).

Why you even want to remove whole module? Usually you only remove 
modules manually when you are developing the driver.

regards
Antti

>
> BR and many thanks for your help,
>    Martin
>
>
> On 10.01.2016 05:43, Antti Palosaari wrote:
>> Hello
>> Those timers are activated on frontend init() ja de-activated on sleep().
>> Removing active driver does not sound good and IMHO it should not be even
>> possible. I think it should be find out why it is possible to remove whole
>> driver before it is put to sleep().
>>
>> Antti
>>
>>
>> On 01/09/2016 11:42 PM, Ernst Martin Witte wrote:
>>> Hi again,
>>>
>>> seems  that the  cause  for the  kernel  panic is  a  missing call  to
>>> cancel_delayed_work_sync in  si2157_remove before  the call  to kfree.
>>> After adding cancel_delayed_work_sync(&dev->stat_work), rmmod does not
>>> trigger the kernel panic any more.
>>>
>>> However, very similar issues could be identified also in other modules:
>>>
>>>     ts2020
>>>     af9013
>>>     af9033
>>>     rtl2830
>>>
>>> when looking in drivers/media/tuners and drivers/media/dvb-frontends.
>>>
>>> Therefore,  the submitted  patch  set contains  fixes  also for  those
>>> modules. The submitted patch set is:
>>>
>>>     [PATCH 0/5] [media] cancel_delayed_work_sync before device removal / kfree
>>>
>>> I hope these patches completely fix the issue and are ok for inclusion
>>> in the kernel.
>>>
>>> BR and thx,
>>>     Martin
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-media" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> http://palosaari.fi/
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-media" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
http://palosaari.fi/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-11  2:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-08 12:18 [REGRESSION/bisected] kernel panic after "rmmod cx23885" by "si2157: implement signal strength stats" Martin Witte
2016-01-09 21:42 ` Ernst Martin Witte
2016-01-10  3:43   ` Antti Palosaari
2016-01-10 22:04     ` Ernst Martin Witte
2016-01-11  2:34       ` Antti Palosaari

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.