All of lore.kernel.org
 help / color / mirror / Atom feed
* Recursive/circular locking in serial8250_console_write/serial8250_do_startup
@ 2020-08-12 15:48 Guenter Roeck
  2020-08-13  5:06 ` Guenter Roeck
  0 siblings, 1 reply; 9+ messages in thread
From: Guenter Roeck @ 2020-08-12 15:48 UTC (permalink / raw)
  To: gregkh; +Cc: linux-serial, linux-kernel

Hi,

crbug.com/1114800 reports a hard lockup due to circular locking in the
8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.

Problem is as follows:
- serial8250_do_startup() locks the serial (console) port.
- serial8250_do_startup() then disables interrupts if interrupts are
  shared, by calling disable_irq_nosync().
- disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
  descriptor.
- __irq_get_desc_lock() calls lock_acquire()
- If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
  are called and identify a potential locking error.
- This locking error is reported via printk, which ultimately calls
  serial8250_console_write().
- serial8250_console_write() tries to lock the serial console port.
  Since it is already locked, the system hangs and ultimately reports
  a hard lockup.

I understand we'll need to figure out and fix what lockdep complains about,
and I am working on that. However, even if that is fixed, we'll need a
solution for the recursive lock: Fixing the lockdep problem doesn't
guarantee that a similar problem (or some other log message) won't be
detected and reported sometime in the future while serial8250_do_startup()
holds the console port lock.

Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-12 15:48 Recursive/circular locking in serial8250_console_write/serial8250_do_startup Guenter Roeck
@ 2020-08-13  5:06 ` Guenter Roeck
  2020-08-13 11:59   ` Greg KH
  0 siblings, 1 reply; 9+ messages in thread
From: Guenter Roeck @ 2020-08-13  5:06 UTC (permalink / raw)
  To: gregkh; +Cc: linux-serial, linux-kernel

On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
> Hi,
> 
> crbug.com/1114800 reports a hard lockup due to circular locking in the
> 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
> 
> Problem is as follows:
> - serial8250_do_startup() locks the serial (console) port.
> - serial8250_do_startup() then disables interrupts if interrupts are
>   shared, by calling disable_irq_nosync().
> - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
>   descriptor.
> - __irq_get_desc_lock() calls lock_acquire()
> - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
>   are called and identify a potential locking error.
> - This locking error is reported via printk, which ultimately calls
>   serial8250_console_write().
> - serial8250_console_write() tries to lock the serial console port.
>   Since it is already locked, the system hangs and ultimately reports
>   a hard lockup.
> 
> I understand we'll need to figure out and fix what lockdep complains about,
> and I am working on that. However, even if that is fixed, we'll need a
> solution for the recursive lock: Fixing the lockdep problem doesn't
> guarantee that a similar problem (or some other log message) won't be
> detected and reported sometime in the future while serial8250_do_startup()
> holds the console port lock.
> 
> Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
> 

Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
when called from serial8250_do_startup(), locks the interrupt descriptor.
The order of locking is
	serial port lock
	  interrupt descriptor lock

At the same time, __setup_irq() locks the interrupt descriptor as well.
With the descriptor locked, it may report an error message using pr_err().
This in turn may call serial8250_console_write(), which will try to lock
the console serial port. The lock sequence is
	interrupt descriptor lock
	  serial port lock

I added the lockdep splat to the bug log at crbug.com/1114800.

Effectively, I think, this means we can't call disable_irq_nosync()
while holding a serial port lock, or at least not while holding a
serial port lock that is associated with a console.

The problem was introduced (or, rather, exposed) with upstream commit
7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").

Guenter

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13  5:06 ` Guenter Roeck
@ 2020-08-13 11:59   ` Greg KH
  2020-08-13 14:20     ` Andy Shevchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Greg KH @ 2020-08-13 11:59 UTC (permalink / raw)
  To: Andy Shevchenko, Guenter Roeck; +Cc: linux-serial, linux-kernel

On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
> On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
> > Hi,
> > 
> > crbug.com/1114800 reports a hard lockup due to circular locking in the
> > 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
> > 
> > Problem is as follows:
> > - serial8250_do_startup() locks the serial (console) port.
> > - serial8250_do_startup() then disables interrupts if interrupts are
> >   shared, by calling disable_irq_nosync().
> > - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
> >   descriptor.
> > - __irq_get_desc_lock() calls lock_acquire()
> > - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
> >   are called and identify a potential locking error.
> > - This locking error is reported via printk, which ultimately calls
> >   serial8250_console_write().
> > - serial8250_console_write() tries to lock the serial console port.
> >   Since it is already locked, the system hangs and ultimately reports
> >   a hard lockup.
> > 
> > I understand we'll need to figure out and fix what lockdep complains about,
> > and I am working on that. However, even if that is fixed, we'll need a
> > solution for the recursive lock: Fixing the lockdep problem doesn't
> > guarantee that a similar problem (or some other log message) won't be
> > detected and reported sometime in the future while serial8250_do_startup()
> > holds the console port lock.
> > 
> > Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
> > 
> 
> Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
> when called from serial8250_do_startup(), locks the interrupt descriptor.
> The order of locking is
> 	serial port lock
> 	  interrupt descriptor lock
> 
> At the same time, __setup_irq() locks the interrupt descriptor as well.
> With the descriptor locked, it may report an error message using pr_err().
> This in turn may call serial8250_console_write(), which will try to lock
> the console serial port. The lock sequence is
> 	interrupt descriptor lock
> 	  serial port lock
> 
> I added the lockdep splat to the bug log at crbug.com/1114800.
> 
> Effectively, I think, this means we can't call disable_irq_nosync()
> while holding a serial port lock, or at least not while holding a
> serial port lock that is associated with a console.
> 
> The problem was introduced (or, rather, exposed) with upstream commit
> 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").

Adding Andy, who wrote the above commit :)

Andy, any thoughts?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 11:59   ` Greg KH
@ 2020-08-13 14:20     ` Andy Shevchenko
  2020-08-13 15:33       ` Guenter Roeck
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Shevchenko @ 2020-08-13 14:20 UTC (permalink / raw)
  To: Greg KH, Sergey Senozhatsky, Petr Mladek, Steven Rostedt
  Cc: Guenter Roeck, linux-serial, linux-kernel, Tony Lindgren,
	Sergey Senozhatsky, John Ogness, kurt, Raul Rangel, S, Shirish

On Thu, Aug 13, 2020 at 01:59:48PM +0200, Greg KH wrote:
> On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
> > On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > crbug.com/1114800 reports a hard lockup due to circular locking in the
> > > 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
> > > 
> > > Problem is as follows:
> > > - serial8250_do_startup() locks the serial (console) port.
> > > - serial8250_do_startup() then disables interrupts if interrupts are
> > >   shared, by calling disable_irq_nosync().
> > > - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
> > >   descriptor.
> > > - __irq_get_desc_lock() calls lock_acquire()
> > > - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
> > >   are called and identify a potential locking error.
> > > - This locking error is reported via printk, which ultimately calls
> > >   serial8250_console_write().
> > > - serial8250_console_write() tries to lock the serial console port.
> > >   Since it is already locked, the system hangs and ultimately reports
> > >   a hard lockup.
> > > 
> > > I understand we'll need to figure out and fix what lockdep complains about,
> > > and I am working on that. However, even if that is fixed, we'll need a
> > > solution for the recursive lock: Fixing the lockdep problem doesn't
> > > guarantee that a similar problem (or some other log message) won't be
> > > detected and reported sometime in the future while serial8250_do_startup()
> > > holds the console port lock.
> > > 
> > > Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
> > > 
> > 
> > Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
> > when called from serial8250_do_startup(), locks the interrupt descriptor.
> > The order of locking is
> > 	serial port lock
> > 	  interrupt descriptor lock
> > 
> > At the same time, __setup_irq() locks the interrupt descriptor as well.
> > With the descriptor locked, it may report an error message using pr_err().
> > This in turn may call serial8250_console_write(), which will try to lock
> > the console serial port. The lock sequence is
> > 	interrupt descriptor lock
> > 	  serial port lock
> > 
> > I added the lockdep splat to the bug log at crbug.com/1114800.
> > 
> > Effectively, I think, this means we can't call disable_irq_nosync()
> > while holding a serial port lock, or at least not while holding a
> > serial port lock that is associated with a console.
> > 
> > The problem was introduced (or, rather, exposed) with upstream commit
> > 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").
> 
> Adding Andy, who wrote the above commit :)
> 
> Andy, any thoughts?

So, we have here a problem and my commit is indeed revealed it since it's
basically did spread of what we used to have only in two drivers (and
originally reported problem was against third one, i.e. 8250_pnp) to all 8250.
Even if we revert that commit, we got the other problem appear, hence it's a
matter who to suffer until the clean solution will be provided.

As per earlier discussion [1] (and I Cc'ed this to people from there) it
appears there is another issue with RT kernels which brought initially that
controversial disable_irq_nosync() call. Same here, if we drop this call
somebody will be unsatisfied.

The real fix possible to go to completely lockless printk(), but on the other
hand it probably won't prevent other locking corner cases (dead lock?) in 8250
console write callback.

The fix proposed in [2] perhaps not the way to go either...

The idea about not allowing disabling IRQ for console port may be least
painful as a (temporary?) mitigation.

[1]: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u
[2]: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 14:20     ` Andy Shevchenko
@ 2020-08-13 15:33       ` Guenter Roeck
  2020-08-13 17:03         ` Guenter Roeck
  0 siblings, 1 reply; 9+ messages in thread
From: Guenter Roeck @ 2020-08-13 15:33 UTC (permalink / raw)
  To: Andy Shevchenko, Greg KH, Sergey Senozhatsky, Petr Mladek,
	Steven Rostedt
  Cc: linux-serial, linux-kernel, Tony Lindgren, Sergey Senozhatsky,
	John Ogness, kurt, Raul Rangel, S, Shirish

On 8/13/20 7:20 AM, Andy Shevchenko wrote:
> On Thu, Aug 13, 2020 at 01:59:48PM +0200, Greg KH wrote:
>> On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
>>> On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
>>>> Hi,
>>>>
>>>> crbug.com/1114800 reports a hard lockup due to circular locking in the
>>>> 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
>>>>
>>>> Problem is as follows:
>>>> - serial8250_do_startup() locks the serial (console) port.
>>>> - serial8250_do_startup() then disables interrupts if interrupts are
>>>>   shared, by calling disable_irq_nosync().
>>>> - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
>>>>   descriptor.
>>>> - __irq_get_desc_lock() calls lock_acquire()
>>>> - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
>>>>   are called and identify a potential locking error.
>>>> - This locking error is reported via printk, which ultimately calls
>>>>   serial8250_console_write().
>>>> - serial8250_console_write() tries to lock the serial console port.
>>>>   Since it is already locked, the system hangs and ultimately reports
>>>>   a hard lockup.
>>>>
>>>> I understand we'll need to figure out and fix what lockdep complains about,
>>>> and I am working on that. However, even if that is fixed, we'll need a
>>>> solution for the recursive lock: Fixing the lockdep problem doesn't
>>>> guarantee that a similar problem (or some other log message) won't be
>>>> detected and reported sometime in the future while serial8250_do_startup()
>>>> holds the console port lock.
>>>>
>>>> Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
>>>>
>>>
>>> Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
>>> when called from serial8250_do_startup(), locks the interrupt descriptor.
>>> The order of locking is
>>> 	serial port lock
>>> 	  interrupt descriptor lock
>>>
>>> At the same time, __setup_irq() locks the interrupt descriptor as well.
>>> With the descriptor locked, it may report an error message using pr_err().
>>> This in turn may call serial8250_console_write(), which will try to lock
>>> the console serial port. The lock sequence is
>>> 	interrupt descriptor lock
>>> 	  serial port lock
>>>
>>> I added the lockdep splat to the bug log at crbug.com/1114800.
>>>
>>> Effectively, I think, this means we can't call disable_irq_nosync()
>>> while holding a serial port lock, or at least not while holding a
>>> serial port lock that is associated with a console.
>>>
>>> The problem was introduced (or, rather, exposed) with upstream commit
>>> 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").
>>
>> Adding Andy, who wrote the above commit :)
>>
>> Andy, any thoughts?
> 
> So, we have here a problem and my commit is indeed revealed it since it's
> basically did spread of what we used to have only in two drivers (and
> originally reported problem was against third one, i.e. 8250_pnp) to all 8250.
> Even if we revert that commit, we got the other problem appear, hence it's a
> matter who to suffer until the clean solution will be provided.
> 
> As per earlier discussion [1] (and I Cc'ed this to people from there) it
> appears there is another issue with RT kernels which brought initially that
> controversial disable_irq_nosync() call. Same here, if we drop this call
> somebody will be unsatisfied.
> 

The lock chain in [1] is
	console_owner --> &port_lock_key --> &irq_desc_lock_class
and in my case ([2], comment 13) it is
	&irq_desc_lock_class --> console_owner --> &port_lock_key

Looks like we just changed the order of locks.

> The real fix possible to go to completely lockless printk(), but on the other
> hand it probably won't prevent other locking corner cases (dead lock?) in 8250
> console write callback.
> spin_lock_irqsave
> The fix proposed in [2] perhaps not the way to go either...
> 

The code in [2] (Comment 10) was just a hack to prevent the hard lockup
from happening and to be able see the lockdep splat. It wasn't supposed
to fix anything (and it doesn't).

Can we change the order of spin_lock_irqsave / disable_irq_nosync ?

-               spin_lock_irqsave(&port->lock, flags);
                if (up->port.irqflags & IRQF_SHARED)
                        disable_irq_nosync(port->irq);
+               spin_lock_irqsave(&port->lock, flags);

[ plus of course the same for unlock ]

Thanks,
Guenter

> The idea about not allowing disabling IRQ for console port may be least
> painful as a (temporary?) mitigation.
> 
> [1]: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u
> [2]: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 15:33       ` Guenter Roeck
@ 2020-08-13 17:03         ` Guenter Roeck
  2020-08-13 18:00           ` Andy Shevchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Guenter Roeck @ 2020-08-13 17:03 UTC (permalink / raw)
  To: Andy Shevchenko, Greg KH, Sergey Senozhatsky, Petr Mladek,
	Steven Rostedt
  Cc: linux-serial, linux-kernel, Tony Lindgren, Sergey Senozhatsky,
	John Ogness, kurt, Raul Rangel, S, Shirish

On 8/13/20 8:33 AM, Guenter Roeck wrote:
> On 8/13/20 7:20 AM, Andy Shevchenko wrote:
>> On Thu, Aug 13, 2020 at 01:59:48PM +0200, Greg KH wrote:
>>> On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
>>>> On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
>>>>> Hi,
>>>>>
>>>>> crbug.com/1114800 reports a hard lockup due to circular locking in the
>>>>> 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
>>>>>
>>>>> Problem is as follows:
>>>>> - serial8250_do_startup() locks the serial (console) port.
>>>>> - serial8250_do_startup() then disables interrupts if interrupts are
>>>>>   shared, by calling disable_irq_nosync().
>>>>> - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
>>>>>   descriptor.
>>>>> - __irq_get_desc_lock() calls lock_acquire()
>>>>> - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
>>>>>   are called and identify a potential locking error.
>>>>> - This locking error is reported via printk, which ultimately calls
>>>>>   serial8250_console_write().
>>>>> - serial8250_console_write() tries to lock the serial console port.
>>>>>   Since it is already locked, the system hangs and ultimately reports
>>>>>   a hard lockup.
>>>>>
>>>>> I understand we'll need to figure out and fix what lockdep complains about,
>>>>> and I am working on that. However, even if that is fixed, we'll need a
>>>>> solution for the recursive lock: Fixing the lockdep problem doesn't
>>>>> guarantee that a similar problem (or some other log message) won't be
>>>>> detected and reported sometime in the future while serial8250_do_startup()
>>>>> holds the console port lock.
>>>>>
>>>>> Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
>>>>>
>>>>
>>>> Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
>>>> when called from serial8250_do_startup(), locks the interrupt descriptor.
>>>> The order of locking is
>>>> 	serial port lock
>>>> 	  interrupt descriptor lock
>>>>
>>>> At the same time, __setup_irq() locks the interrupt descriptor as well.
>>>> With the descriptor locked, it may report an error message using pr_err().
>>>> This in turn may call serial8250_console_write(), which will try to lock
>>>> the console serial port. The lock sequence is
>>>> 	interrupt descriptor lock
>>>> 	  serial port lock
>>>>
>>>> I added the lockdep splat to the bug log at crbug.com/1114800.
>>>>
>>>> Effectively, I think, this means we can't call disable_irq_nosync()
>>>> while holding a serial port lock, or at least not while holding a
>>>> serial port lock that is associated with a console.
>>>>
>>>> The problem was introduced (or, rather, exposed) with upstream commit
>>>> 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").
>>>
>>> Adding Andy, who wrote the above commit :)
>>>
>>> Andy, any thoughts?
>>
>> So, we have here a problem and my commit is indeed revealed it since it's
>> basically did spread of what we used to have only in two drivers (and
>> originally reported problem was against third one, i.e. 8250_pnp) to all 8250.
>> Even if we revert that commit, we got the other problem appear, hence it's a
>> matter who to suffer until the clean solution will be provided.
>>
>> As per earlier discussion [1] (and I Cc'ed this to people from there) it
>> appears there is another issue with RT kernels which brought initially that
>> controversial disable_irq_nosync() call. Same here, if we drop this call
>> somebody will be unsatisfied.
>>
> 
> The lock chain in [1] is
> 	console_owner --> &port_lock_key --> &irq_desc_lock_class
> and in my case ([2], comment 13) it is
> 	&irq_desc_lock_class --> console_owner --> &port_lock_key
> 
> Looks like we just changed the order of locks.
> 
>> The real fix possible to go to completely lockless printk(), but on the other
>> hand it probably won't prevent other locking corner cases (dead lock?) in 8250
>> console write callback.
>> spin_lock_irqsave
>> The fix proposed in [2] perhaps not the way to go either...
>>
> 
> The code in [2] (Comment 10) was just a hack to prevent the hard lockup
> from happening and to be able see the lockdep splat. It wasn't supposed
> to fix anything (and it doesn't).
> 
> Can we change the order of spin_lock_irqsave / disable_irq_nosync ?
> 
> -               spin_lock_irqsave(&port->lock, flags);
>                 if (up->port.irqflags & IRQF_SHARED)
>                         disable_irq_nosync(port->irq);
> +               spin_lock_irqsave(&port->lock, flags);
> 
> [ plus of course the same for unlock ]
> 

I tried the above, and it solves the problem for me. It is less than perfect
(interrupts will be disabled for more time if the code has to wait for the
spinlock), but I _think_ it should work.

Should I submit a patch ? Or is there a major flaw in my thinking ?

Thanks,
Guenter

> 
>> The idea about not allowing disabling IRQ for console port may be least
>> painful as a (temporary?) mitigation.
>>
>> [1]: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u
>> [2]: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800
>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 17:03         ` Guenter Roeck
@ 2020-08-13 18:00           ` Andy Shevchenko
  2020-08-13 18:38             ` Guenter Roeck
  2020-08-14  1:39             ` Sergey Senozhatsky
  0 siblings, 2 replies; 9+ messages in thread
From: Andy Shevchenko @ 2020-08-13 18:00 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Greg KH, Sergey Senozhatsky, Petr Mladek, Steven Rostedt,
	linux-serial, linux-kernel, Tony Lindgren, Sergey Senozhatsky,
	John Ogness, kurt, Raul Rangel, S, Shirish

On Thu, Aug 13, 2020 at 10:03:44AM -0700, Guenter Roeck wrote:
> On 8/13/20 8:33 AM, Guenter Roeck wrote:
> > On 8/13/20 7:20 AM, Andy Shevchenko wrote:
> >> On Thu, Aug 13, 2020 at 01:59:48PM +0200, Greg KH wrote:
> >>> On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
> >>>> On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
> >>>>> Hi,
> >>>>>
> >>>>> crbug.com/1114800 reports a hard lockup due to circular locking in the
> >>>>> 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
> >>>>>
> >>>>> Problem is as follows:
> >>>>> - serial8250_do_startup() locks the serial (console) port.
> >>>>> - serial8250_do_startup() then disables interrupts if interrupts are
> >>>>>   shared, by calling disable_irq_nosync().
> >>>>> - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
> >>>>>   descriptor.
> >>>>> - __irq_get_desc_lock() calls lock_acquire()
> >>>>> - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
> >>>>>   are called and identify a potential locking error.
> >>>>> - This locking error is reported via printk, which ultimately calls
> >>>>>   serial8250_console_write().
> >>>>> - serial8250_console_write() tries to lock the serial console port.
> >>>>>   Since it is already locked, the system hangs and ultimately reports
> >>>>>   a hard lockup.
> >>>>>
> >>>>> I understand we'll need to figure out and fix what lockdep complains about,
> >>>>> and I am working on that. However, even if that is fixed, we'll need a
> >>>>> solution for the recursive lock: Fixing the lockdep problem doesn't
> >>>>> guarantee that a similar problem (or some other log message) won't be
> >>>>> detected and reported sometime in the future while serial8250_do_startup()
> >>>>> holds the console port lock.
> >>>>>
> >>>>> Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
> >>>>>
> >>>>
> >>>> Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
> >>>> when called from serial8250_do_startup(), locks the interrupt descriptor.
> >>>> The order of locking is
> >>>> 	serial port lock
> >>>> 	  interrupt descriptor lock
> >>>>
> >>>> At the same time, __setup_irq() locks the interrupt descriptor as well.
> >>>> With the descriptor locked, it may report an error message using pr_err().
> >>>> This in turn may call serial8250_console_write(), which will try to lock
> >>>> the console serial port. The lock sequence is
> >>>> 	interrupt descriptor lock
> >>>> 	  serial port lock
> >>>>
> >>>> I added the lockdep splat to the bug log at crbug.com/1114800.
> >>>>
> >>>> Effectively, I think, this means we can't call disable_irq_nosync()
> >>>> while holding a serial port lock, or at least not while holding a
> >>>> serial port lock that is associated with a console.
> >>>>
> >>>> The problem was introduced (or, rather, exposed) with upstream commit
> >>>> 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").
> >>>
> >>> Adding Andy, who wrote the above commit :)
> >>>
> >>> Andy, any thoughts?
> >>
> >> So, we have here a problem and my commit is indeed revealed it since it's
> >> basically did spread of what we used to have only in two drivers (and
> >> originally reported problem was against third one, i.e. 8250_pnp) to all 8250.
> >> Even if we revert that commit, we got the other problem appear, hence it's a
> >> matter who to suffer until the clean solution will be provided.
> >>
> >> As per earlier discussion [1] (and I Cc'ed this to people from there) it
> >> appears there is another issue with RT kernels which brought initially that
> >> controversial disable_irq_nosync() call. Same here, if we drop this call
> >> somebody will be unsatisfied.
> >>
> > 
> > The lock chain in [1] is
> > 	console_owner --> &port_lock_key --> &irq_desc_lock_class
> > and in my case ([2], comment 13) it is
> > 	&irq_desc_lock_class --> console_owner --> &port_lock_key
> > 
> > Looks like we just changed the order of locks.
> > 
> >> The real fix possible to go to completely lockless printk(), but on the other
> >> hand it probably won't prevent other locking corner cases (dead lock?) in 8250
> >> console write callback.
> >> spin_lock_irqsave
> >> The fix proposed in [2] perhaps not the way to go either...
> >>
> > 
> > The code in [2] (Comment 10) was just a hack to prevent the hard lockup
> > from happening and to be able see the lockdep splat. It wasn't supposed
> > to fix anything (and it doesn't).
> > 
> > Can we change the order of spin_lock_irqsave / disable_irq_nosync ?
> > 
> > -               spin_lock_irqsave(&port->lock, flags);
> >                 if (up->port.irqflags & IRQF_SHARED)
> >                         disable_irq_nosync(port->irq);
> > +               spin_lock_irqsave(&port->lock, flags);
> > 
> > [ plus of course the same for unlock ]
> > 
> 
> I tried the above, and it solves the problem for me. It is less than perfect
> (interrupts will be disabled for more time if the code has to wait for the
> spinlock), but I _think_ it should work.
> 
> Should I submit a patch ? Or is there a major flaw in my thinking ?

That's basically what Sergey (IIRC) proposed in [1].
(You missed enable_irq() part in above)

I guess we may try it. Sergey, can you submit a formal patch?

> >> The idea about not allowing disabling IRQ for console port may be least
> >> painful as a (temporary?) mitigation.
> >>
> >> [1]: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u
> >> [2]: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 18:00           ` Andy Shevchenko
@ 2020-08-13 18:38             ` Guenter Roeck
  2020-08-14  1:39             ` Sergey Senozhatsky
  1 sibling, 0 replies; 9+ messages in thread
From: Guenter Roeck @ 2020-08-13 18:38 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Greg KH, Sergey Senozhatsky, Petr Mladek, Steven Rostedt,
	linux-serial, linux-kernel, Tony Lindgren, Sergey Senozhatsky,
	John Ogness, kurt, Raul Rangel, S, Shirish

On Thu, Aug 13, 2020 at 09:00:13PM +0300, Andy Shevchenko wrote:
> On Thu, Aug 13, 2020 at 10:03:44AM -0700, Guenter Roeck wrote:
> > On 8/13/20 8:33 AM, Guenter Roeck wrote:
> > > On 8/13/20 7:20 AM, Andy Shevchenko wrote:
> > >> On Thu, Aug 13, 2020 at 01:59:48PM +0200, Greg KH wrote:
> > >>> On Wed, Aug 12, 2020 at 10:06:29PM -0700, Guenter Roeck wrote:
> > >>>> On Wed, Aug 12, 2020 at 08:48:13AM -0700, Guenter Roeck wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> crbug.com/1114800 reports a hard lockup due to circular locking in the
> > >>>>> 8250 console driver. This is seen if CONFIG_PROVE_LOCKING is enabled.
> > >>>>>
> > >>>>> Problem is as follows:
> > >>>>> - serial8250_do_startup() locks the serial (console) port.
> > >>>>> - serial8250_do_startup() then disables interrupts if interrupts are
> > >>>>>   shared, by calling disable_irq_nosync().
> > >>>>> - disable_irq_nosync() calls __irq_get_desc_lock() to lock the interrupt
> > >>>>>   descriptor.
> > >>>>> - __irq_get_desc_lock() calls lock_acquire()
> > >>>>> - If CONFIG_PROVE_LOCKING is enabled, validate_chain() and check_noncircular()
> > >>>>>   are called and identify a potential locking error.
> > >>>>> - This locking error is reported via printk, which ultimately calls
> > >>>>>   serial8250_console_write().
> > >>>>> - serial8250_console_write() tries to lock the serial console port.
> > >>>>>   Since it is already locked, the system hangs and ultimately reports
> > >>>>>   a hard lockup.
> > >>>>>
> > >>>>> I understand we'll need to figure out and fix what lockdep complains about,
> > >>>>> and I am working on that. However, even if that is fixed, we'll need a
> > >>>>> solution for the recursive lock: Fixing the lockdep problem doesn't
> > >>>>> guarantee that a similar problem (or some other log message) won't be
> > >>>>> detected and reported sometime in the future while serial8250_do_startup()
> > >>>>> holds the console port lock.
> > >>>>>
> > >>>>> Ideas, anyone ? Everything I came up with so far seems clumsy and hackish.
> > >>>>>
> > >>>>
> > >>>> Turns out the situation is a bit worse than I thought. disable_irq_nosync(),
> > >>>> when called from serial8250_do_startup(), locks the interrupt descriptor.
> > >>>> The order of locking is
> > >>>> 	serial port lock
> > >>>> 	  interrupt descriptor lock
> > >>>>
> > >>>> At the same time, __setup_irq() locks the interrupt descriptor as well.
> > >>>> With the descriptor locked, it may report an error message using pr_err().
> > >>>> This in turn may call serial8250_console_write(), which will try to lock
> > >>>> the console serial port. The lock sequence is
> > >>>> 	interrupt descriptor lock
> > >>>> 	  serial port lock
> > >>>>
> > >>>> I added the lockdep splat to the bug log at crbug.com/1114800.
> > >>>>
> > >>>> Effectively, I think, this means we can't call disable_irq_nosync()
> > >>>> while holding a serial port lock, or at least not while holding a
> > >>>> serial port lock that is associated with a console.
> > >>>>
> > >>>> The problem was introduced (or, rather, exposed) with upstream commit
> > >>>> 7febbcbc48fc ("serial: 8250: Check UPF_IRQ_SHARED in advance").
> > >>>
> > >>> Adding Andy, who wrote the above commit :)
> > >>>
> > >>> Andy, any thoughts?
> > >>
> > >> So, we have here a problem and my commit is indeed revealed it since it's
> > >> basically did spread of what we used to have only in two drivers (and
> > >> originally reported problem was against third one, i.e. 8250_pnp) to all 8250.
> > >> Even if we revert that commit, we got the other problem appear, hence it's a
> > >> matter who to suffer until the clean solution will be provided.
> > >>
> > >> As per earlier discussion [1] (and I Cc'ed this to people from there) it
> > >> appears there is another issue with RT kernels which brought initially that
> > >> controversial disable_irq_nosync() call. Same here, if we drop this call
> > >> somebody will be unsatisfied.
> > >>
> > > 
> > > The lock chain in [1] is
> > > 	console_owner --> &port_lock_key --> &irq_desc_lock_class
> > > and in my case ([2], comment 13) it is
> > > 	&irq_desc_lock_class --> console_owner --> &port_lock_key
> > > 
> > > Looks like we just changed the order of locks.
> > > 
> > >> The real fix possible to go to completely lockless printk(), but on the other
> > >> hand it probably won't prevent other locking corner cases (dead lock?) in 8250
> > >> console write callback.
> > >> spin_lock_irqsave
> > >> The fix proposed in [2] perhaps not the way to go either...
> > >>
> > > 
> > > The code in [2] (Comment 10) was just a hack to prevent the hard lockup
> > > from happening and to be able see the lockdep splat. It wasn't supposed
> > > to fix anything (and it doesn't).
> > > 
> > > Can we change the order of spin_lock_irqsave / disable_irq_nosync ?
> > > 
> > > -               spin_lock_irqsave(&port->lock, flags);
> > >                 if (up->port.irqflags & IRQF_SHARED)
> > >                         disable_irq_nosync(port->irq);
> > > +               spin_lock_irqsave(&port->lock, flags);
> > > 
> > > [ plus of course the same for unlock ]
> > > 
> > 
> > I tried the above, and it solves the problem for me. It is less than perfect
> > (interrupts will be disabled for more time if the code has to wait for the
> > spinlock), but I _think_ it should work.
> > 
> > Should I submit a patch ? Or is there a major flaw in my thinking ?
> 
> That's basically what Sergey (IIRC) proposed in [1].
> (You missed enable_irq() part in above)

You are correct; I had missed that. Anyway, I was just lazy:

"[ plus of course the same for unlock ]"

See [3] for the real thing.

> 
> I guess we may try it. Sergey, can you submit a formal patch?
> 
Please Cc: me on it.

Thanks,
Guenter

> > >> The idea about not allowing disabling IRQ for console port may be least
> > >> painful as a (temporary?) mitigation.
> > >>
> > >> [1]: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u
> > >> [2]: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
[3] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2354478

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recursive/circular locking in serial8250_console_write/serial8250_do_startup
  2020-08-13 18:00           ` Andy Shevchenko
  2020-08-13 18:38             ` Guenter Roeck
@ 2020-08-14  1:39             ` Sergey Senozhatsky
  1 sibling, 0 replies; 9+ messages in thread
From: Sergey Senozhatsky @ 2020-08-14  1:39 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Guenter Roeck, Greg KH, Sergey Senozhatsky, Petr Mladek,
	Steven Rostedt, linux-serial, linux-kernel, Tony Lindgren,
	Sergey Senozhatsky, John Ogness, kurt, Raul Rangel, S, Shirish

Hi Andy,

On (20/08/13 21:00), Andy Shevchenko wrote:
> That's basically what Sergey (IIRC) proposed in [1].
> (You missed enable_irq() part in above)
> 
> I guess we may try it. Sergey, can you submit a formal patch?

OK, let me try.

	-ss

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-08-14  1:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-12 15:48 Recursive/circular locking in serial8250_console_write/serial8250_do_startup Guenter Roeck
2020-08-13  5:06 ` Guenter Roeck
2020-08-13 11:59   ` Greg KH
2020-08-13 14:20     ` Andy Shevchenko
2020-08-13 15:33       ` Guenter Roeck
2020-08-13 17:03         ` Guenter Roeck
2020-08-13 18:00           ` Andy Shevchenko
2020-08-13 18:38             ` Guenter Roeck
2020-08-14  1:39             ` Sergey Senozhatsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.