All of lore.kernel.org
 help / color / mirror / Atom feed
* open() on /dev/tty takes 30 seconds on 2.6.36
@ 2010-10-30  7:47 Eli Billauer
  2010-10-30 18:46 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Billauer @ 2010-10-30  7:47 UTC (permalink / raw)
  To: linux-kernel

Hello,

I'm sorry about my previous overconclusive email, but there is a real 
problem with opening TTYs during a few minutes after a system boots 
(Fedora 12 in my case). I'll stick to the facts this time.

The kernel involved is 2.6.36, downloaded a few days ago as "latest 
stable kernel" at kernel.org.

Running strace on sshd with -T and -tt flags, hence showing the time 
each call took, these two lines were found. The number in brackets is 
the time the system call took (not to the time to the next call or 
something).

...
21:35:21.131436 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) <30.022532>
...
21:35:51.175642 open("/dev/pts/0", O_RDWR) = 4 <30.063213>
...

So it took 30 seconds just to fail opening /dev/tty.

I then went on to add printk's in pty.c. Among others, I had:

static int ptmx_open(struct inode *inode, struct file *filp)
{
    struct tty_struct *tty;
    int retval;
    int index;

    nonseekable_open(inode, filp);

    /* find a device that is not in use. */
    printk(KERN_ALERT  "34: ptmx_open to lock\n");
    tty_lock();
    printk(KERN_ALERT  "35: ptmx_open locked\n");

[snipped here]

And then found in my /var/log/messages (no log lines between these two):
Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock
Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked

So we can see it took 15 seconds to acquire a lock in this case. In all 
other pairs of lock messages there was no time difference. To me it 
looks like 15 seconds in order to acquire a lock in the kernel is a 
smoking gun.

Please look into this.
   Eli

-- 
Web: http://www.billauer.co.il


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-30  7:47 open() on /dev/tty takes 30 seconds on 2.6.36 Eli Billauer
@ 2010-10-30 18:46 ` Andrew Morton
  2010-10-30 20:46   ` Arnd Bergmann
  2010-10-31  0:20   ` Eli Billauer
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2010-10-30 18:46 UTC (permalink / raw)
  To: Eli Billauer; +Cc: linux-kernel

On Sat, 30 Oct 2010 09:47:26 +0200 Eli Billauer <eli@billauer.co.il> wrote:

> Hello,
> 
> I'm sorry about my previous overconclusive email, but there is a real 
> problem with opening TTYs during a few minutes after a system boots 
> (Fedora 12 in my case). I'll stick to the facts this time.
> 
> The kernel involved is 2.6.36, downloaded a few days ago as "latest 
> stable kernel" at kernel.org.
> 
> Running strace on sshd with -T and -tt flags, hence showing the time 
> each call took, these two lines were found. The number in brackets is 
> the time the system call took (not to the time to the next call or 
> something).
> 
> ...
> 21:35:21.131436 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
> device or address) <30.022532>
> ...
> 21:35:51.175642 open("/dev/pts/0", O_RDWR) = 4 <30.063213>
> ...
> 
> So it took 30 seconds just to fail opening /dev/tty.
> 
> I then went on to add printk's in pty.c. Among others, I had:
> 
> static int ptmx_open(struct inode *inode, struct file *filp)
> {
>     struct tty_struct *tty;
>     int retval;
>     int index;
> 
>     nonseekable_open(inode, filp);
> 
>     /* find a device that is not in use. */
>     printk(KERN_ALERT  "34: ptmx_open to lock\n");
>     tty_lock();
>     printk(KERN_ALERT  "35: ptmx_open locked\n");
> 
> [snipped here]
> 
> And then found in my /var/log/messages (no log lines between these two):
> Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock
> Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked
> 
> So we can see it took 15 seconds to acquire a lock in this case. In all 
> other pairs of lock messages there was no time difference. To me it 
> looks like 15 seconds in order to acquire a lock in the kernel is a 
> smoking gun.
> 

Odd.  Presumably someone else was holding big_tty_mutex for 15 seconds.

We can find out who, with the sysrq-d command if you have the time
please.  You'll need to enable lockdep and magic sysrq in .config. 
Then run `dmesg -n 8' so all messages get printed by the kernel and
then, in the middle of that 15-second delay, do

	echo d > /proc/sysrq-trigger

I'll confess that I've never used sysrq-d and am unsure what the output
looks like.  Fingers crossed!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-30 18:46 ` Andrew Morton
@ 2010-10-30 20:46   ` Arnd Bergmann
  2010-10-31  0:20   ` Eli Billauer
  1 sibling, 0 replies; 14+ messages in thread
From: Arnd Bergmann @ 2010-10-30 20:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Eli Billauer, linux-kernel

On Saturday 30 October 2010, Andrew Morton wrote:
> > static int ptmx_open(struct inode *inode, struct file *filp)
> > {
> >     struct tty_struct *tty;
> >     int retval;
> >     int index;
> > 
> >     nonseekable_open(inode, filp);
> > 
> >     /* find a device that is not in use. */
> >     printk(KERN_ALERT  "34: ptmx_open to lock\n");
> >     tty_lock();
> >     printk(KERN_ALERT  "35: ptmx_open locked\n");
> > 
> > [snipped here]
> > 
> > And then found in my /var/log/messages (no log lines between these two):
> > Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock
> > Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked
> > 
> > So we can see it took 15 seconds to acquire a lock in this case. In all 
> > other pairs of lock messages there was no time difference. To me it 
> > looks like 15 seconds in order to acquire a lock in the kernel is a 
> > smoking gun.

Agreed. When I changed the locking to use a mutex instead of the BKL,
I intentionally left it locked across sleeping functions, but none of
them are supposed to sleep for multiple seconds.

> 
> Odd.  Presumably someone else was holding big_tty_mutex for 15 seconds.
> 
> We can find out who, with the sysrq-d command if you have the time
> please.  You'll need to enable lockdep and magic sysrq in .config. 
> Then run `dmesg -n 8' so all messages get printed by the kernel and
> then, in the middle of that 15-second delay, do
> 
>         echo d > /proc/sysrq-trigger
> 
> I'll confess that I've never used sysrq-d and am unsure what the output
> looks like.  Fingers crossed!

With a little luck, lockdep by itself might even be able to find an
inconsistency in the BTM use and warn right away about it. I found
a number of bugs in the TTY code just by running with lockdep in the
mutex version.

	Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-30 18:46 ` Andrew Morton
  2010-10-30 20:46   ` Arnd Bergmann
@ 2010-10-31  0:20   ` Eli Billauer
  2010-10-31  3:36     ` Arnd Bergmann
  1 sibling, 1 reply; 14+ messages in thread
From: Eli Billauer @ 2010-10-31  0:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> Odd.  Presumably someone else was holding big_tty_mutex for 15 seconds.
>
> We can find out who, with the sysrq-d command if you have the time
> please.  You'll need to enable lockdep and magic sysrq in .config. 
> Then run `dmesg -n 8' so all messages get printed by the kernel and
> then, in the middle of that 15-second delay, do
>
> 	echo d > /proc/sysrq-trigger
>
> I'll confess that I've never used sysrq-d and am unsure what the output
> looks like.  Fingers crossed!
>   

First, I'll explain what I did: Since I can't login while this problem 
occurs, I had to modify sshd's service script to start sshd with strace, 
and also kick off a script (called who-is-locking.sh), which creates 
that sysrq call every five seconds. This could also be the place to 
mention, that I got an NULL dereference oops during one of these (IP at 
get_usage_chars+0x0/0x10e, while in the who-is-locking process). But 
that happened after the desired data was logged.

Now, let's look at the interesting point of the strace (of sshd):

01:23:37.156475 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) <29.999444>
01:24:07.156354 setsid()                = 2336 <0.000167>

The same open() taking 30 seconds, ending at 01:24:07. In 
/var/log/messages (below) we have
Oct 31 01:24:07 ocho modem-manager: (ttyS2) opening serial device...
Oct 31 01:24:07 ocho modem-manager: (ttyS3) closing serial device...

The same things happens again, when ssh opens /dev/pts/0, which takes 30 
seconds again, with the end of operation coinciding with modem-manager 
opening a tty device.

As far as I know, there is only one real serial port on my computer. 
Maybe this snippet from my boot log may be of interest:

Oct 31 01:14:28 ocho kernel: Serial: 8250/16550 driver, 4 ports, IRQ 
sharing enabled
Oct 31 01:14:28 ocho kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is 
a 16550A
Oct 31 01:14:28 ocho kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Oct 31 01:14:28 ocho kernel: serial 0000:05:04.0: PCI INT A -> GSI 16 
(level, low) -> IRQ 16
Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS1 at I/O 0xbe20 (irq = 
16) is a 8250
Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS2 at I/O 0xbe28 (irq = 
16) is a 8250
Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS3 at I/O 0xbe40 (irq = 
16) is a 8250
Oct 31 01:14:28 ocho kernel: Couldn't register serial port 0000:05:04.0: -28

Below is /var/log/message dump for the relevant time interval. I can 
submit more if necessary, but I think the picture is pretty clear.

Hope this helps,
    Eli

Oct 31 01:23:59 ocho kernel: =============================================
Oct 31 01:23:59 ocho kernel:
Oct 31 01:24:04 ocho kernel: SysRq : Show Locks Held
Oct 31 01:24:04 ocho kernel:
Oct 31 01:24:04 ocho kernel: Showing all locks held in the system:
Oct 31 01:24:04 ocho kernel: 2 locks held by modem-manager/1835:
Oct 31 01:24:04 ocho kernel: #0:  (big_tty_mutex){+.+.+.}, at: 
[<ffffffff814b9c09>] tty_lock+0x3f/0x4a
Oct 31 01:24:04 ocho kernel: #1:  (&port->mutex){+.+.+.}, at: 
[<ffffffff812fcace>] uart_close+0x82/0x312
Oct 31 01:24:04 ocho kernel: 2 locks held by who-is-locking./1850:
Oct 31 01:24:04 ocho kernel: #0:  (sysrq_key_table_lock){......}, at: 
[<ffffffff812e97de>] __handle_sysrq+0x28/0x15b
Oct 31 01:24:04 ocho kernel: #1:  (tasklist_lock){.?.+..}, at: 
[<ffffffff810798a5>] debug_show_all_locks+0x4c/0x178
Oct 31 01:24:04 ocho kernel: 1 lock held by gpm/1971:
Oct 31 01:24:04 ocho kernel: #0:  (tty_mutex){+.+.+.}, at: 
[<ffffffff812d5903>] tty_open+0x5a/0x4ac
Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2125:
Oct 31 01:24:04 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2126:
Oct 31 01:24:04 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2127:
Oct 31 01:24:04 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2128:
Oct 31 01:24:04 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2129:
Oct 31 01:24:04 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:04 ocho kernel: 2 locks held by sshd/2336:
Oct 31 01:24:04 ocho kernel: #0:  (tty_mutex){+.+.+.}, at: 
[<ffffffff812d5903>] tty_open+0x5a/0x4ac
Oct 31 01:24:04 ocho kernel: #1:  (big_tty_mutex){+.+.+.}, at: 
[<ffffffff814b9c09>] tty_lock+0x3f/0x4a
Oct 31 01:24:04 ocho kernel:
Oct 31 01:24:04 ocho kernel: =============================================
Oct 31 01:24:04 ocho kernel:
Oct 31 01:24:07 ocho modem-manager: (ttyS2) opening serial device...
Oct 31 01:24:07 ocho modem-manager: (ttyS3) closing serial device...
Oct 31 01:24:09 ocho kernel: SysRq : Show Locks Held
Oct 31 01:24:09 ocho kernel:
Oct 31 01:24:09 ocho kernel: Showing all locks held in the system:
Oct 31 01:24:09 ocho kernel: 2 locks held by modem-manager/1835:
Oct 31 01:24:09 ocho kernel: #0:  (big_tty_mutex){+.+.+.}, at: 
[<ffffffff814b9c09>] tty_lock+0x3f/0x4a
Oct 31 01:24:09 ocho kernel: #1:  (&port->mutex){+.+.+.}, at: 
[<ffffffff812fcace>] uart_close+0x82/0x312
Oct 31 01:24:09 ocho kernel: 2 locks held by who-is-locking./1850:
Oct 31 01:24:09 ocho kernel: #0:  (sysrq_key_table_lock){......}, at: 
[<ffffffff812e97de>] __handle_sysrq+0x28/0x15b
Oct 31 01:24:09 ocho kernel: #1:  (tasklist_lock){.?.+..}, at: 
[<ffffffff810798a5>] debug_show_all_locks+0x4c/0x178
Oct 31 01:24:09 ocho kernel: 1 lock held by gpm/1971:
Oct 31 01:24:09 ocho kernel: #0:  (tty_mutex){+.+.+.}, at: 
[<ffffffff812d5903>] tty_open+0x5a/0x4ac
Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2125:
Oct 31 01:24:09 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2126:
Oct 31 01:24:09 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2127:
Oct 31 01:24:09 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2128:
Oct 31 01:24:09 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2129:
Oct 31 01:24:09 ocho kernel: #0:  (&tty->atomic_read_lock){+.+...}, at: 
[<ffffffff812d7648>] n_tty_read+0x281/0x7b4
Oct 31 01:24:09 ocho kernel: 2 locks held by sshd/2336:
Oct 31 01:24:09 ocho kernel: #0:  (tty_mutex){+.+.+.}, at: 
[<ffffffff812d5903>] tty_open+0x5a/0x4ac
Oct 31 01:24:09 ocho kernel: #1:  (big_tty_mutex){+.+.+.}, at: 
[<ffffffff814b9c09>] tty_lock+0x3f/0x4a
Oct 31 01:24:09 ocho kernel:
Oct 31 01:24:09 ocho kernel: =============================================
Oct 31 01:24:09 ocho kernel:

-- 

Web: http://www.billauer.co.il


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-31  0:20   ` Eli Billauer
@ 2010-10-31  3:36     ` Arnd Bergmann
  2010-10-31  6:34       ` James Cloos
  2010-11-01  1:12       ` Eli Billauer
  0 siblings, 2 replies; 14+ messages in thread
From: Arnd Bergmann @ 2010-10-31  3:36 UTC (permalink / raw)
  To: Eli Billauer; +Cc: Andrew Morton, linux-kernel

On Sunday 31 October 2010, Eli Billauer wrote:
> Andrew Morton wrote:
> 
> First, I'll explain what I did: Since I can't login while this problem 
> occurs, I had to modify sshd's service script to start sshd with strace, 
> and also kick off a script (called who-is-locking.sh), which creates 
> that sysrq call every five seconds. This could also be the place to 
> mention, that I got an NULL dereference oops during one of these (IP at 
> get_usage_chars+0x0/0x10e, while in the who-is-locking process). But 
> that happened after the desired data was logged.

An oops report is interesting most of the time, except when you know
that you caused it with your own kernel changes. If you can still
find the oops report, please post it here.

> Oct 31 01:14:28 ocho kernel: Serial: 8250/16550 driver, 4 ports, IRQ 
> sharing enabled
> Oct 31 01:14:28 ocho kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is 
> a 16550A
> Oct 31 01:14:28 ocho kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> Oct 31 01:14:28 ocho kernel: serial 0000:05:04.0: PCI INT A -> GSI 16 
> (level, low) -> IRQ 16
> Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS1 at I/O 0xbe20 (irq = 
> 16) is a 8250
> Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS2 at I/O 0xbe28 (irq = 
> 16) is a 8250
> Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS3 at I/O 0xbe40 (irq = 
> 16) is a 8250
> Oct 31 01:14:28 ocho kernel: Couldn't register serial port 0000:05:04.0: -28

For some reason, your kernel finds five ports at boot time, while it's
been configured at build time to support only four.

> Oct 31 01:23:59 ocho kernel: =============================================
> Oct 31 01:23:59 ocho kernel:
> Oct 31 01:24:04 ocho kernel: SysRq : Show Locks Held
> Oct 31 01:24:04 ocho kernel:
> Oct 31 01:24:04 ocho kernel: Showing all locks held in the system:
> Oct 31 01:24:04 ocho kernel: 2 locks held by modem-manager/1835:
> Oct 31 01:24:04 ocho kernel: #0:  (big_tty_mutex){+.+.+.}, at: 
> [<ffffffff814b9c09>] tty_lock+0x3f/0x4a
> Oct 31 01:24:04 ocho kernel: #1:  (&port->mutex){+.+.+.}, at: 
> [<ffffffff812fcace>] uart_close+0x82/0x312

So modem-manager is the process you need to look at, as it holds the
BTM, probably for a long time. It should not do that. First you should
verify that when you don't have modem-manager running, the problem goes
away. This may indeed be related to the misdetection of a serial port.

The other thing to check is what modem-manager is actually doing. You can
find out by looking at /proc/<pid>/stack, where <pid> is the pid of the
modem-manager process, 1835 in your example.

You can also look at <sysrq>+t to get all the stack traces for the running
processes and finding the process in there. 

I think you're closing in on the bug, thanks to your providing exactly the
right parts of information.

	Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-31  3:36     ` Arnd Bergmann
@ 2010-10-31  6:34       ` James Cloos
  2010-10-31 11:36         ` Arnd Bergmann
  2010-11-01  1:12       ` Eli Billauer
  1 sibling, 1 reply; 14+ messages in thread
From: James Cloos @ 2010-10-31  6:34 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel

[I'm not sure whether this is the same issue, but in case it helps... -JimC]

I was about to prepare a bug report about a similar issue which as hit
my (normally headless) compute node.  It happened once before I first
booted 2.6.36, and once running 2.6.36.  The last two kernels which
I compiled before v2.6.36 were 9c03f16 and bfa88ea; I cannot remember
whether I ever booted 9c03f16.  I may have compiled v2.6.36 before
trying 9c03f16. The box is a Phenom, 64bit kernel and userland.

All of the ptys stopped working, which seems like it might be related to
a change in tty locking.

Everything else worked normally; only pty i/o stopped.  I couldn't log
in, or use any of my open terminals, but pipes, networking, disk i/o,
et al were all OK.

I was able to use 'ssh server dmesg' to get the dmesg; it had this to say:

WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2()
Hardware name: MS-7642
Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp
Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18
Call Trace:
 [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d
 [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2
 [<ffffffff81068453>] worker_thread+0x182/0x19b
 [<ffffffff810682d1>] ? worker_thread+0x0/0x19b
 [<ffffffff8106ba81>] kthread+0x82/0x8a
 [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8106b9ff>] ? kthread+0x0/0x8a
 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
---[ end trace 756b0818a6415dca ]---

That was followed by a number of task blocked for more than 120 seconds
messages, all due to waiting for pty input or output.  An example trace:

INFO: task gpg:15408 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gpg           D ffff8802507ec380     0 15408  15390 0x00000000
 ffff8802cf24fd88 0000000000000082 ffff880200000001 0000000000014b00
 0000000000014b00 ffff88010a45ae20 ffff8802cf24ffd8 0000000000014b00
 0000000000014b00 0000000000014b00 0000000000014b00 ffff8802cf24ffd8
Call Trace:
 [<ffffffff815ae2fc>] schedule_timeout+0x36/0xe3
 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41
 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41
 [<ffffffff815b2b6d>] ? sub_preempt_count+0x97/0xaa
 [<ffffffff815ad8a5>] wait_for_common+0xab/0x105
 [<ffffffff8104c1c4>] ? default_wake_function+0x0/0x14
 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41
 [<ffffffff81065dd1>] ? __need_more_worker+0x15/0x2c
 [<ffffffff810d39ba>] ? lru_add_drain_per_cpu+0x0/0x10
 [<ffffffff815ad9b7>] wait_for_completion+0x1d/0x1f
 [<ffffffff81066d0d>] flush_work+0x110/0x12e
 [<ffffffff81066777>] ? wq_barrier_func+0x0/0x14
 [<ffffffff81068831>] schedule_on_each_cpu+0xa8/0xd7
 [<ffffffff810d3472>] lru_add_drain_all+0x15/0x17
 [<ffffffff810e7786>] sys_mlock+0x30/0xdf
 [<ffffffff81009cc2>] system_call_fastpath+0x16/0x1b

(I didn't save the dmesg the first time it happened; I wanted to try .36
before filing a bug report, in case it had already been fixed.)

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-31  6:34       ` James Cloos
@ 2010-10-31 11:36         ` Arnd Bergmann
  2010-11-02 10:28           ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Arnd Bergmann @ 2010-10-31 11:36 UTC (permalink / raw)
  To: James Cloos; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Tejun Heo

On Sunday 31 October 2010, James Cloos wrote:
> WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2()
> Hardware name: MS-7642
> Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp
> Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18
> Call Trace:
>  [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d
>  [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c
>  [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2
>  [<ffffffff81068453>] worker_thread+0x182/0x19b
>  [<ffffffff810682d1>] ? worker_thread+0x0/0x19b
>  [<ffffffff8106ba81>] kthread+0x82/0x8a
>  [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff8106b9ff>] ? kthread+0x0/0x8a
>  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace 756b0818a6415dca ]---
> 
> That was followed by a number of task blocked for more than 120 seconds
> messages, all due to waiting for pty input or output.  An example trace:

This is triggered by an internal check in the workqueue handling that
Tejun introduced. I guess it might be possible that this is triggered by
a worker thread being blocked on a mutex, but I don't understand
exactly what it is checking for. 

	Arnd


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-31  3:36     ` Arnd Bergmann
  2010-10-31  6:34       ` James Cloos
@ 2010-11-01  1:12       ` Eli Billauer
  2010-11-01 19:39         ` Arnd Bergmann
  1 sibling, 1 reply; 14+ messages in thread
From: Eli Billauer @ 2010-11-01  1:12 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Andrew Morton, linux-kernel

Let me start with the headline: It's close() of a ttySx that takes 30 
seconds, during which the big TTY lock is held. Now to details.


Arnd Bergmann wrote:

> So modem-manager is the process you need to look at, as it holds the
> BTM, probably for a long time. It should not do that. First you should
> verify that when you don't have modem-manager running, the problem goes
> away. This may indeed be related to the misdetection of a serial port.
>   
That test is built-in: When modem-manager stops poking around, 
everything becomes fine again.
> The other thing to check is what modem-manager is actually doing. You can
> find out by looking at /proc/<pid>/stack, where <pid> is the pid of the
> modem-manager process, 1835 in your example.
>   
OK, so this is what I did eventually: I ran strace on sshd and 
modem-manager, and I also added printk's to tell me when tty_lock() is 
invoked, when it actually got the lock, and when tty_unlock() is 
invoked. Pretty messy, and still.

Let's start with sshd's strace. We have a
01:06:20.152787 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) <30.008153>

And this is what we have in /var/log/messages:

Nov  1 01:06:20 ocho modem-manager: (ttyS2) closing serial device...
Nov  1 01:06:20 ocho kernel: tty_lock() called
Nov  1 01:06:20 ocho kernel: tty_lock() got lock
Nov  1 01:06:20 ocho kernel: tty_lock() called
Nov  1 01:06:50 ocho kernel: tty_unlock() called
Nov  1 01:06:50 ocho kernel: tty_lock() got lock
Nov  1 01:06:50 ocho kernel: tty_unlock() called
Nov  1 01:06:50 ocho kernel: tty_lock() called

So we can see that someone got the lock, and then didn't release it for 
30 seconds, and that someone else tried to get the lock and waited for 
it 30 seconds.

Here comes modem-manager's strace in the relevant time:
01:06:20.096730 sendto(3, "<29>Nov  1 01:06:20 modem-manager: (ttyS2) 
closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000048>
01:06:20.096932 close(22)               = 0 <30.064112>
01:06:50.161255 ioctl(22, SNDCTL_TMR_START or TCSETS, {B9600 opost isig 
icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000132>

Surprise, surprise. modem-manager waits 30 seconds for a close() 
operation (maybe I should change this thread's subject?). But maybe 
modem-manager is waiting for someone else? Nope. In another part of 
modem-manager's strace we have:
01:05:50.001380 sendto(3, "<29>Nov  1 01:05:49 modem-manager: (ttyS1) 
closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000237>
01:05:50.001765 close(21)               = 0 <30.064420>

Again (as always, actually), modem-manager waits 30 seconds for a 
close(). In /var/log/messages we have:
Nov  1 01:05:50 ocho modem-manager: (ttyS1) closing serial device...
Nov  1 01:05:50 ocho kernel: tty_lock() called
Nov  1 01:05:50 ocho kernel: tty_lock() got lock
Nov  1 01:05:54 ocho kernel: tty_lock() called
Nov  1 01:06:20 ocho kernel: tty_unlock() called

The last lock-related log entry was an unlock 9 seconds earlier. So it's 
clear that modem-manager's close() call got the lock right away, but 
held it for 30 seconds. In the middle we have some other process asking 
for the lock (and is put to wait).

An strace of modem-manager running under vanilla kernel 2.6.35-4 shows 
that close() takes 30 seconds there as well. It just so happens that it 
went unnoticed (maybe because of the changes in the locking schema?).

And in case we suspect that something special modem-manager did, I've 
attached the relevant piece of strace while handling some TTY. I've cut 
the log to cover a full cycle of a single TTY (file descriptor 21), but 
I chose not to omit the calls relevant to other ports, which are handled 
meanwhile.

I should only note that all TTY close() calls take 30 seconds, and that 
all ports get a ioctl() and extra close() after the file descriptor is 
closed, with an EBADF as response.

Trace follows. Hope this helps.
    Eli


01:06:20.067616 open("/dev/ttyS1", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 
21 <0.020836>
01:06:20.088688 ioctl(21, TIOCEXCL, 0)  = 0 <0.000043>
01:06:20.088873 ioctl(21, TCFLSH, 0x2)  = 0 <0.000096>
01:06:20.089106 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000098>
01:06:20.089377 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000092>
01:06:20.089699 ioctl(21, SNDCTL_TMR_START or TCSETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000072>
01:06:20.089909 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000056>
01:06:20.090088 fstat(21, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 65), 
...}) = 0 <0.000091>
01:06:20.090294 fcntl(21, F_GETFL)      = 0x8802 (flags 
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000088>
01:06:20.090515 poll([{fd=22, events=POLLIN}, {fd=23, events=POLLIN}, 
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, 
{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, 
{fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, 
{fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, 
{fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, 
{fd=5, events=POLLIN}, {fd=21, events=POLLIN}], 19, 0) = 0 (Timeout) 
<0.000034>
01:06:20.090752 poll([{fd=22, events=POLLIN}, {fd=23, events=POLLIN}, 
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, 
{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, 
{fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, 
{fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, 
{fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, 
{fd=5, events=POLLIN}, {fd=21, events=POLLIN}], 19, 0) = 0 (Timeout) 
<0.000032>
01:06:20.090916 write(21, "\0", 1)      = 1 <0.000028>
01:06:20.091012 nanosleep({0, 1000000}, NULL) = 0 <0.001152>
01:06:20.092246 write(21, "x", 1)       = 1 <0.000027>
01:06:20.092425 nanosleep({0, 1000000}, NULL) = 0 <0.001090>
01:06:20.093690 write(21, "\360", 1)    = 1 <0.000093>
01:06:20.093950 nanosleep({0, 1000000}, NULL) = 0 <0.001154>
01:06:20.095203 write(21, "~", 1)       = 1 <0.000094>
01:06:20.095389 nanosleep({0, 1000000}, NULL) = 0 <0.001144>
01:06:20.096730 sendto(3, "<29>Nov  1 01:06:20 modem-manager: (ttyS2) 
closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000048>
01:06:20.096932 close(22)               = 0 <30.064112>
01:06:50.161255 ioctl(22, SNDCTL_TMR_START or TCSETS, {B9600 opost isig 
icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000132>
01:06:50.161623 close(22)               = -1 EBADF (Bad file descriptor) 
<0.000121>
01:06:50.162061 sendto(3, "<29>Nov  1 01:06:50 modem-manager: (ttyS2) 
opening serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000165>
01:06:50.162465 open("/dev/ttyS2", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 
22 <0.020821>
01:06:50.183512 ioctl(22, TIOCEXCL, 0)  = 0 <0.000163>
01:06:50.183894 ioctl(22, TCFLSH, 0x2)  = 0 <0.000102>
01:06:50.184121 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000090>
01:06:50.184374 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000092>
01:06:50.184631 ioctl(22, SNDCTL_TMR_START or TCSETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000050>
01:06:50.184768 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000074>
01:06:50.184934 fstat(22, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 66), 
...}) = 0 <0.000020>
01:06:50.185040 fcntl(22, F_GETFL)      = 0x8802 (flags 
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000020>
01:06:50.185237 sendto(3, "<29>Nov  1 01:06:50 modem-manager: (ttyS3) 
closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000026>
01:06:50.185358 close(23)               = 0 <30.063507>
01:07:20.248957 ioctl(23, SNDCTL_TMR_START or TCSETS, {B9600 opost isig 
icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000025>
01:07:20.249074 close(23)               = -1 EBADF (Bad file descriptor) 
<0.000020>
01:07:20.249242 sendto(3, "<29>Nov  1 01:07:20 modem-manager: (ttyS3) 
opening serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000050>
01:07:20.249397 open("/dev/ttyS3", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 
23 <0.021043>
01:07:20.270531 ioctl(23, TIOCEXCL, 0)  = 0 <0.000031>
01:07:20.270628 ioctl(23, TCFLSH, 0x2)  = 0 <0.000104>
01:07:20.270816 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000029>
01:07:20.270940 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost 
isig -icanon -echo ...}) = 0 <0.000023>
01:07:20.271039 ioctl(23, SNDCTL_TMR_START or TCSETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000032>
01:07:20.271147 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost 
-isig -icanon -echo ...}) = 0 <0.000023>
01:07:20.271248 fstat(23, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 67), 
...}) = 0 <0.000022>
01:07:20.271362 fcntl(23, F_GETFL)      = 0x8802 (flags 
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000021>
01:07:20.271494 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, 
{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, 
{fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, 
{fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, 
{fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, 
{fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) 
<0.000031>
01:07:20.271666 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, 
{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, 
{fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, 
{fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, 
{fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, 
{fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) 
<0.000042>
01:07:20.271868 write(22, "\0", 1)      = 1 <0.000040>
01:07:20.271983 nanosleep({0, 1000000}, NULL) = 0 <0.001189>
01:07:20.273259 write(22, "x", 1)       = 1 <0.000041>
01:07:20.273378 nanosleep({0, 1000000}, NULL) = 0 <0.001188>
01:07:20.274653 write(22, "\360", 1)    = 1 <0.000036>
01:07:20.274807 nanosleep({0, 1000000}, NULL) = 0 <0.001193>
01:07:20.276086 write(22, "~", 1)       = 1 <0.000035>
01:07:20.276213 nanosleep({0, 1000000}, NULL) = 0 <0.001190>
01:07:20.277509 write(23, "\0", 1)      = 1 <0.000039>
01:07:20.277626 nanosleep({0, 1000000}, NULL) = 0 <0.001189>
01:07:20.278900 write(23, "x", 1)       = 1 <0.000035>
01:07:20.279011 nanosleep({0, 1000000}, NULL) = 0 <0.001192>
01:07:20.280288 write(23, "\360", 1)    = 1 <0.000039>
01:07:20.280404 nanosleep({0, 1000000}, NULL) = 0 <0.001193>
01:07:20.281695 write(23, "~", 1)       = 1 <0.000039>
01:07:20.281814 nanosleep({0, 1000000}, NULL) = 0 <0.001191>
01:07:20.283113 write(21, "\0", 1)      = 1 <0.000039>
01:07:20.283232 nanosleep({0, 1000000}, NULL) = 0 <0.001190>
01:07:20.284506 write(21, "x", 1)       = 1 <0.000034>
01:07:20.284615 nanosleep({0, 1000000}, NULL) = 0 <0.001189>
01:07:20.285888 write(21, "\360", 1)    = 1 <0.000040>
01:07:20.286005 nanosleep({0, 1000000}, NULL) = 0 <0.001186>
01:07:20.287276 write(21, "~", 1)       = 1 <0.000040>
01:07:20.287392 nanosleep({0, 1000000}, NULL) = 0 <0.001186>
01:07:20.288690 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, 
{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, 
{fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, 
{fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, 
{fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, 
{fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 3000) = 0 (Timeout) 
<3.003405>
01:07:23.292250 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, 
{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, 
{fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, 
{fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, 
{fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, 
{fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 707) = 0 (Timeout) 
<0.707898>
01:07:24.000331 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, 
{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, 
{fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, 
{fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, 
{fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, 
{fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) 
<0.000044>
01:07:24.000506 write(22, "\0", 1)      = 1 <0.000038>
01:07:24.000620 nanosleep({0, 1000000}, NULL) = 0 <0.001209>
01:07:24.001914 write(22, "x", 1)       = 1 <0.000039>
01:07:24.002028 nanosleep({0, 1000000}, NULL) = 0 <0.001208>
01:07:24.003320 write(22, "\360", 1)    = 1 <0.000035>
01:07:24.003434 nanosleep({0, 1000000}, NULL) = 0 <0.001207>
01:07:24.004743 write(22, "~", 1)       = 1 <0.000036>
01:07:24.004867 nanosleep({0, 1000000}, NULL) = 0 <0.001189>
01:07:24.006171 write(23, "\0", 1)      = 1 <0.000039>
01:07:24.006300 nanosleep({0, 1000000}, NULL) = 0 <0.001177>
01:07:24.007571 write(23, "x", 1)       = 1 <0.000034>
01:07:24.007696 nanosleep({0, 1000000}, NULL) = 0 <0.001189>
01:07:24.008988 write(23, "\360", 1)    = 1 <0.000034>
01:07:24.009113 nanosleep({0, 1000000}, NULL) = 0 <0.001191>
01:07:24.010402 write(23, "~", 1)       = 1 <0.000034>
01:07:24.010526 nanosleep({0, 1000000}, NULL) = 0 <0.001192>
01:07:24.011859 sendto(3, "<29>Nov  1 01:07:24 modem-manager: (ttyS1) 
closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000084>
01:07:24.012044 close(21)               = 0 <30.063920>
01:07:54.076064 ioctl(21, SNDCTL_TMR_START or TCSETS, {B57600 -opost 
isig -icanon -echo ...}) = -1 EBADF (Bad file descriptor) <0.000021>
01:07:54.076186 close(21)               = -1 EBADF (Bad file descriptor) 
<0.000019>

-- 
Web: http://www.billauer.co.il


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-11-01  1:12       ` Eli Billauer
@ 2010-11-01 19:39         ` Arnd Bergmann
  2010-11-01 20:46           ` Alan Cox
  2010-11-03  0:15           ` Eli Billauer
  0 siblings, 2 replies; 14+ messages in thread
From: Arnd Bergmann @ 2010-11-01 19:39 UTC (permalink / raw)
  To: Eli Billauer; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox

On Monday 01 November 2010, Eli Billauer wrote:
> Let me start with the headline: It's close() of a ttySx that takes 30 
> seconds, during which the big TTY lock is held. Now to details.

Ok, good to hear you tracked it down this far!

> Arnd Bergmann wrote:
> 
> > So modem-manager is the process you need to look at, as it holds the
> > BTM, probably for a long time. It should not do that. First you should
> > verify that when you don't have modem-manager running, the problem goes
> > away. This may indeed be related to the misdetection of a serial port.
> >   
> That test is built-in: When modem-manager stops poking around, 
> everything becomes fine again.

Ok.

> > The other thing to check is what modem-manager is actually doing. You can
> > find out by looking at /proc/<pid>/stack, where <pid> is the pid of the
> > modem-manager process, 1835 in your example.
> >   
> OK, so this is what I did eventually: I ran strace on sshd and 
> modem-manager, and I also added printk's to tell me when tty_lock() is 
> invoked, when it actually got the lock, and when tty_unlock() is 
> invoked. Pretty messy, and still.
> 
> Let's start with sshd's strace. We have a
> 01:06:20.152787 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
> device or address) <30.008153>

That looks unrelated, as far as I can tell.

> And this is what we have in /var/log/messages:
> 
> Nov  1 01:06:20 ocho modem-manager: (ttyS2) closing serial device...
> Nov  1 01:06:20 ocho kernel: tty_lock() called
> Nov  1 01:06:20 ocho kernel: tty_lock() got lock
> Nov  1 01:06:20 ocho kernel: tty_lock() called
> Nov  1 01:06:50 ocho kernel: tty_unlock() called
> Nov  1 01:06:50 ocho kernel: tty_lock() got lock
> Nov  1 01:06:50 ocho kernel: tty_unlock() called
> Nov  1 01:06:50 ocho kernel: tty_lock() called
> 
> So we can see that someone got the lock, and then didn't release it for 
> 30 seconds, and that someone else tried to get the lock and waited for 
> it 30 seconds.

Ok, that's certainly modem-manager in the tty_release function, calling
uart_close, which has this code:

        if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) {
                /*
                 * hack: open-coded tty_wait_until_sent to avoid
                 * recursive tty_lock
                 */
                long timeout = msecs_to_jiffies(port->closing_wait);
                if (wait_event_interruptible_timeout(tty->write_wait,
                                !tty_chars_in_buffer(tty), timeout) >= 0)
                        __uart_wait_until_sent(uport, timeout);
        }

port->closing_wait gets initialized to 30 seconds, which fits your
symptom.

I'm not sure whether it blocks in wait_event or in __uart_wait_until_sent
though, both get the same timeout.

> An strace of modem-manager running under vanilla kernel 2.6.35-4 shows 
> that close() takes 30 seconds there as well. It just so happens that it 
> went unnoticed (maybe because of the changes in the locking schema?).

Exactly. The code used to hold the BKL, which gets implicitly released
during a sleeping operation like wait_event_interruptible_timeout or 
__uart_wait_until_sent. With the change to a mutex, this does not
happen any more.

When I did the change, I did not anticipate the call actually timing
out the 30 seconds.

> I should only note that all TTY close() calls take 30 seconds, and that 
> all ports get a ioctl() and extra close() after the file descriptor is 
> closed, with an EBADF as response.

This must be an unrelated bug in modem-manager.

Can you confirm that this only happens on tty1 through tty4, but not on
tty0, which is your real uart?

I think we need to find out why it's hitting the timeout most of all,
and that means understanding what the hardware on your system is actually
doing or supposed to be doing. What does "lspci -vvx" show about the
port? Can you disable it in some way?

	Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-11-01 19:39         ` Arnd Bergmann
@ 2010-11-01 20:46           ` Alan Cox
  2010-11-03  0:15           ` Eli Billauer
  1 sibling, 0 replies; 14+ messages in thread
From: Alan Cox @ 2010-11-01 20:46 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Greg KH

> I think we need to find out why it's hitting the timeout most of all,

Its a perfectly normal happening in some cases - eg when you the remote
end asserts flow control and then hangs up on you leaving it that way.

So we need to handle this by dropping locks correctly.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-10-31 11:36         ` Arnd Bergmann
@ 2010-11-02 10:28           ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2010-11-02 10:28 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: James Cloos, Eli Billauer, Andrew Morton, linux-kernel

Hello,

On 10/31/2010 12:36 PM, Arnd Bergmann wrote:
> On Sunday 31 October 2010, James Cloos wrote:
>> WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2()
>> Hardware name: MS-7642
>> Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp
>> Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18
>> Call Trace:
>>  [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d
>>  [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c
>>  [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2
>>  [<ffffffff81068453>] worker_thread+0x182/0x19b
>>  [<ffffffff810682d1>] ? worker_thread+0x0/0x19b
>>  [<ffffffff8106ba81>] kthread+0x82/0x8a
>>  [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
>>  [<ffffffff8106b9ff>] ? kthread+0x0/0x8a
>>  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
>> ---[ end trace 756b0818a6415dca ]---
>>
>> That was followed by a number of task blocked for more than 120 seconds
>> messages, all due to waiting for pty input or output.  An example trace:
> 
> This is triggered by an internal check in the workqueue handling that
> Tejun introduced. I guess it might be possible that this is triggered by
> a worker thread being blocked on a mutex, but I don't understand
> exactly what it is checking for. 

Hmmm... that's a condition which shouldn't happen regardless of being
blocked on mutex or whatever.  It means that internal book keeping
used to manage concurrency went out of sync somehow.  I've re-read the
code managing nr_running but couldn't spot where the problem could be.
Is there any way to reproduce the problem?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-11-01 19:39         ` Arnd Bergmann
  2010-11-01 20:46           ` Alan Cox
@ 2010-11-03  0:15           ` Eli Billauer
  2010-11-03  3:32             ` Arnd Bergmann
  1 sibling, 1 reply; 14+ messages in thread
From: Eli Billauer @ 2010-11-03  0:15 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox

Arnd Bergmann wrote:

> Ok, that's certainly modem-manager in the tty_release function, calling
> uart_close, which has this code:
>
>         if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) {
>                 /*
>                  * hack: open-coded tty_wait_until_sent to avoid
>                  * recursive tty_lock
>                  */
>                 long timeout = msecs_to_jiffies(port->closing_wait);
>                 if (wait_event_interruptible_timeout(tty->write_wait,
>                                 !tty_chars_in_buffer(tty), timeout) >= 0)
>                         __uart_wait_until_sent(uport, timeout);
>         }
>
> port->closing_wait gets initialized to 30 seconds, which fits your
> symptom.
>   
I lost you here. Maybe my version is outdated? The most similar thing I 
have is this in tty_port.c, in tty_port_close_start():

    if (test_bit(ASYNCB_INITIALIZED, &port->flags) &&
            port->closing_wait != ASYNC_CLOSING_WAIT_NONE)
        tty_wait_until_sent(tty, port->closing_wait);

I put a printk on invocations of tty_wait_until_sent(), which was called 
like crazy during bootup on behalf of tty1, and then never again after 
boot was completed. I should point out, that during this session, all I 
did was to wait a few minutes and then reboot the computer from the GUI 
login console. So I never logged in. Anyhow, the ttySx ports were opened 
and closed, the same 30 seconds delays, but no call to 
tty_wait_until_sent() until kernel logging was stopped.

> What does "lspci -vvx" show about the
> port? 
It turns out, that the device, to which ttyS1-ttyS3 are attached is a 
soft modem, which doesn't even have drivers for a 64 bits system. There 
is no /dev/ttyS4, which is consistent with the "Couldn't register serial 
port 0000:05:04.0: -28" message. Anyhow, I can't say I understand why 
any serial port was allocated to this modem. But it's not like I 
understand how it should work.

05:04.0 Modem: ALi Corporation SmartLink SmartPCI563 56K Modem (prog-if 
00 [Generic])
    Subsystem: Device 2020:545a
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 32
    Interrupt: pin A routed to IRQ 16
    Region 0: Memory at fbbff000 (32-bit, non-prefetchable) [size=4K]
    Region 1: I/O ports at be00 [size=256]
    Capabilities: [40] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Kernel driver in use: serial
00: b9 10 5a 54 07 00 90 02 00 00 03 07 00 20 00 00
10: 00 f0 bf fb 01 be 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 20 20 5a 54
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00

   Eli

-- 
Web: http://www.billauer.co.il


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-11-03  0:15           ` Eli Billauer
@ 2010-11-03  3:32             ` Arnd Bergmann
  2010-11-03 10:34               ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Arnd Bergmann @ 2010-11-03  3:32 UTC (permalink / raw)
  To: Eli Billauer; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox

On Wednesday 03 November 2010, Eli Billauer wrote:
> Arnd Bergmann wrote:
> 
> > Ok, that's certainly modem-manager in the tty_release function, calling
> > uart_close, which has this code:
> >
> >         if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) {
> >                 /*
> >                  * hack: open-coded tty_wait_until_sent to avoid
> >                  * recursive tty_lock
> >                  */
> >                 long timeout = msecs_to_jiffies(port->closing_wait);
> >                 if (wait_event_interruptible_timeout(tty->write_wait,
> >                                 !tty_chars_in_buffer(tty), timeout) >= 0)
> >                         __uart_wait_until_sent(uport, timeout);
> >         }
> >
> > port->closing_wait gets initialized to 30 seconds, which fits your
> > symptom.
> >   
> I lost you here. Maybe my version is outdated? The most similar thing I 
> have is this in tty_port.c, in tty_port_close_start():
> 
>     if (test_bit(ASYNCB_INITIALIZED, &port->flags) &&
>             port->closing_wait != ASYNC_CLOSING_WAIT_NONE)
>         tty_wait_until_sent(tty, port->closing_wait);

The code I quoted is from uart_close() in drivers/serial/serial_core.c

> I put a printk on invocations of tty_wait_until_sent(), which was called 
> like crazy during bootup on behalf of tty1, and then never again after 
> boot was completed. I should point out, that during this session, all I 
> did was to wait a few minutes and then reboot the computer from the GUI 
> login console. So I never logged in. Anyhow, the ttySx ports were opened 
> and closed, the same 30 seconds delays, but no call to 
> tty_wait_until_sent() until kernel logging was stopped.

Right. The uart_close() function does not call tty_wait_until_sent,
it calls __uart_wait_until_sent directly.

> > What does "lspci -vvx" show about the
> > port? 
> It turns out, that the device, to which ttyS1-ttyS3 are attached is a 
> soft modem, which doesn't even have drivers for a 64 bits system. There 
> is no /dev/ttyS4, which is consistent with the "Couldn't register serial 
> port 0000:05:04.0: -28" message. Anyhow, I can't say I understand why 
> any serial port was allocated to this modem. But it's not like I 
> understand how it should work.
> 
> 05:04.0 Modem: ALi Corporation SmartLink SmartPCI563 56K Modem (prog-if 
> 00 [Generic])

It should actually be covered by the intel8x0m.c driver as far as I can tell,
but it's missing from the PCI IDs in that driver...

As Alan said, we need to sort out the locking though. One problem is that
we cannot just do tty_unlock/tty_lock around the sleeping call, because we
also hold &port->mutex there, which nests inside of tty_lock().

I hope Alan can figure out if it's either safe to drop both here, or if we
might be able to call uart_close without tty_lock() held in the first place.

	Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: open() on /dev/tty takes 30 seconds on 2.6.36
  2010-11-03  3:32             ` Arnd Bergmann
@ 2010-11-03 10:34               ` Alan Cox
  0 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2010-11-03 10:34 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Greg KH

> I hope Alan can figure out if it's either safe to drop both here, or if we
> might be able to call uart_close without tty_lock() held in the first place.

That was always my intention and why I moved it to tty_port. I think it
is safe to do that, but as far as I can tell the port mutex is assumed
held by the low level drivers during the uart ops calls some of the time.

Safest is probably to drop the tty lock before we take the port mutex and
take it again when we exit.

The tty_port fields are protected by the port mutex/lock

The uport methods by the uport lock

The only two points of concern I see are updating of closing_wait as it
is read (no big deal), and the nasty - which is tty_ldisc_flush. I am not
sure what assumptions lurk in the ldisc flush paths but I think it's ok.

uart_wait_until_sent will also need to not take the tty lock at that
point to fix it properly.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-11-03 10:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-30  7:47 open() on /dev/tty takes 30 seconds on 2.6.36 Eli Billauer
2010-10-30 18:46 ` Andrew Morton
2010-10-30 20:46   ` Arnd Bergmann
2010-10-31  0:20   ` Eli Billauer
2010-10-31  3:36     ` Arnd Bergmann
2010-10-31  6:34       ` James Cloos
2010-10-31 11:36         ` Arnd Bergmann
2010-11-02 10:28           ` Tejun Heo
2010-11-01  1:12       ` Eli Billauer
2010-11-01 19:39         ` Arnd Bergmann
2010-11-01 20:46           ` Alan Cox
2010-11-03  0:15           ` Eli Billauer
2010-11-03  3:32             ` Arnd Bergmann
2010-11-03 10:34               ` Alan Cox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.