linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tasks getting stuck on mmap_sem?
@ 2010-08-17  4:12 David Miller
  2010-08-17 16:14 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2010-08-17  4:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds


I just started seeing this with current GIT, is it related
to the page table locking shuffle of a few days ago?

[21029.265646] INFO: task git:16643 blocked for more than 120 seconds.
[21029.267054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21029.268412] git           D 000000000072f804     0 16643  16273 0x210001101000080
[21029.269809] Call Trace:
[21029.271062]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
[21029.272390]  [000000000072f804] __down_read+0x30/0x40
[21029.273689]  [0000000000731148] do_sparc64_fault+0x238/0x788
[21029.274996]  [00000000004079e8] sparc64_realfault_common+0x10/0x20
[21029.276347] INFO: task git:16644 blocked for more than 120 seconds.
[21029.277655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21029.278993] git           D 000000000072f874     0 16644  16273 0x210041101000080
[21029.280348] Call Trace:
[21029.281542]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
[21029.282818]  [000000000072f874] __down_write+0x34/0x44
[21029.284057]  [00000000004d9c44] SyS_mmap_pgoff+0xc8/0x118
[21029.285293]  [0000000000406154] linux_sparc_syscall32+0x34/0x40
 ...

etc.  it's a set of git threads computing deltas for a remote pull
request to the machine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17  4:12 tasks getting stuck on mmap_sem? David Miller
@ 2010-08-17 16:14 ` Linus Torvalds
  2010-08-17 18:24   ` Tony Luck
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2010-08-17 16:14 UTC (permalink / raw)
  To: David Miller, Michel Lespinasse, David Howells, Andrew Morton
  Cc: linux-kernel

On Mon, Aug 16, 2010 at 9:12 PM, David Miller <davem@davemloft.net> wrote:
>
> I just started seeing this with current GIT, is it related
> to the page table locking shuffle of a few days ago?

No. Looks like the rwsem changes broke sparc too. ia64 had some problems too.

ia64 had the type problems, but even after they were fixed there was
reports of something similar. So I am starting to suspect that there
is something more serious going on here - possibly hidden on x86 due
to the strong memory ordering. And possibly just because mmap_sem is
_mostly_ just used by a single thread. Threaded apps are still
uncommon.

Adding the guilty parties to the cc, leaving your backtrace quoted.

                       Linus

---
> [21029.265646] INFO: task git:16643 blocked for more than 120 seconds.
> [21029.267054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [21029.268412] git           D 000000000072f804     0 16643  16273 0x210001101000080
> [21029.269809] Call Trace:
> [21029.271062]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
> [21029.272390]  [000000000072f804] __down_read+0x30/0x40
> [21029.273689]  [0000000000731148] do_sparc64_fault+0x238/0x788
> [21029.274996]  [00000000004079e8] sparc64_realfault_common+0x10/0x20
> [21029.276347] INFO: task git:16644 blocked for more than 120 seconds.
> [21029.277655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [21029.278993] git           D 000000000072f874     0 16644  16273 0x210041101000080
> [21029.280348] Call Trace:
> [21029.281542]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
> [21029.282818]  [000000000072f874] __down_write+0x34/0x44
> [21029.284057]  [00000000004d9c44] SyS_mmap_pgoff+0xc8/0x118
> [21029.285293]  [0000000000406154] linux_sparc_syscall32+0x34/0x40
>  ...
>
> etc.  it's a set of git threads computing deltas for a remote pull
> request to the machine.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 16:14 ` Linus Torvalds
@ 2010-08-17 18:24   ` Tony Luck
  2010-08-17 20:46     ` Tony Luck
  2010-08-17 21:44     ` David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Tony Luck @ 2010-08-17 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Michel Lespinasse, David Howells, Andrew Morton,
	linux-kernel

On Tue, Aug 17, 2010 at 9:14 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> No. Looks like the rwsem changes broke sparc too. ia64 had some problems too.

I did have some similar mmap_sem issues - but the combination of
fixing the types
of the RWSEM_* defines to be unsigned, and the return value of
ia64_atomic64_add()
to be "long" rather than "int" looks to have cleared up the problems I
was seeing. I
could generally see the hung processes in less than 10 consecutive
kernel builds, but
ran a few thousand builds over the weekend with no issues.

If git is multi-threaded, it may be hitting some different code path
... but it isn't trivial
for me to try this out (my systems are on an isolated lab network segment).

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 18:24   ` Tony Luck
@ 2010-08-17 20:46     ` Tony Luck
  2010-08-17 21:08       ` Linus Torvalds
  2010-08-17 21:44     ` David Miller
  1 sibling, 1 reply; 14+ messages in thread
From: Tony Luck @ 2010-08-17 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Michel Lespinasse, David Howells, Andrew Morton,
	linux-kernel

On Tue, Aug 17, 2010 at 11:24 AM, Tony Luck <tony.luck@intel.com> wrote:
> ... but it isn't trivial
> for me to try this out (my systems are on an isolated lab network segment).

Setting up for "git pull" would have been messy - but "git grep" is
multi-threaded in newer versions of git. So I've set a loop to keep
running "git grep" over and over.  Load average is above 7, so
there must be some threading going on, even though I only see
one "git" process in "top(1)".  Seems to be running fine so far.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 20:46     ` Tony Luck
@ 2010-08-17 21:08       ` Linus Torvalds
  2010-08-17 21:26         ` Tony Luck
  2010-08-17 21:28         ` Linus Torvalds
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2010-08-17 21:08 UTC (permalink / raw)
  To: Tony Luck
  Cc: David Miller, Michel Lespinasse, David Howells, Andrew Morton,
	linux-kernel

On Tue, Aug 17, 2010 at 1:46 PM, Tony Luck <tony.luck@intel.com> wrote:
>
> Setting up for "git pull" would have been messy - but "git grep" is
> multi-threaded in newer versions of git. So I've set a loop to keep
> running "git grep" over and over.  Load average is above 7, so
> there must be some threading going on, even though I only see
> one "git" process in "top(1)".  Seems to be running fine so far.

I'm not sure how page-fault intensive that 'git grep' thing is - I
think it does mainly open/read/close on the files and probably doesn't
take that many faults once malloc() has created the main heap. In
contrast, the delta compute code (that david hit) is probably going to
do a lot more page faults due to being way more data-intensive (many
more mmap/munmap calls - it's how git tends to like to map git
objects).

So a more fault-intensive - and still threaded and all local - git
load is likely to do something like

   git repack -adf

and do it on something smaller than the whole kernel (git itself is
probably a good test-case, and probably isn't that far off from a
medium-sized kernel "pull" request).

                 Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 21:08       ` Linus Torvalds
@ 2010-08-17 21:26         ` Tony Luck
  2010-08-17 21:28         ` Linus Torvalds
  1 sibling, 0 replies; 14+ messages in thread
From: Tony Luck @ 2010-08-17 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Michel Lespinasse, David Howells, Andrew Morton,
	linux-kernel

On Tue, Aug 17, 2010 at 2:08 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>   git repack -adf

$ /usr/bin/time git repack -adf
Counting objects: 125946, done.
Delta compression using up to 32 threads.
Compressing objects: 100% (124132/124132), done.
Writing objects: 100% (125946/125946), done.
Total 125946 (delta 92181), reused 0 (delta 0)
75.80user 5.83system 0:28.59elapsed 285%CPU (0avgtext+0avgdata
12414976maxresident)k
0inputs+0outputs (9major+17615minor)pagefaults 0swaps

Some threading, some page faults, no processes stuck on rwsem things.

Left that command in a while loop ... 10 repeats so far.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 21:08       ` Linus Torvalds
  2010-08-17 21:26         ` Tony Luck
@ 2010-08-17 21:28         ` Linus Torvalds
  2010-08-17 21:47           ` Michel Lespinasse
  2010-08-17 23:40           ` David Miller
  1 sibling, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2010-08-17 21:28 UTC (permalink / raw)
  To: Tony Luck
  Cc: David Miller, Michel Lespinasse, David Howells, Andrew Morton,
	linux-kernel

Oh, and it strikes me when looking at the sparc rwsem-const.h that it
has some unsigned constants in it. I think

  #define RWSEM_WAITING_BIAS              0xffff0000

is actually unsigned, because it's a hex constant that doesn't fit in
a signed int. So it's of type 'unsigned int' if I recall the C type
rules right.

(That's one of the really _subtle_ C type rules: hex constants
actually act differently from decimal constants).

So now RWSEM_ACTIVE_WRITE_BIAS is unsigned too.

The x86 constants are all explicitly signed.

Btw, Michel sent out an email saying:

  FYI, I just verified that RWSEM_WAITING_BIAS is defined as signed on
  all architectures except for ia64. So, this would be consistent with
  the issue being observed only on ia64.

but I think he missed that subtlety of the C type system. Other architectures do

  #define RWSEM_WAITING_BIAS             (-0x00010000)

which is indeed signed, but Sparc is unsigned despite the lack of a "u".

Of course, maybe I only _think_ I know all the C typing rules. I might
be full of sh*t. It happens..I didn't actually look them up.

             Linus


                  Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 18:24   ` Tony Luck
  2010-08-17 20:46     ` Tony Luck
@ 2010-08-17 21:44     ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2010-08-17 21:44 UTC (permalink / raw)
  To: tony.luck; +Cc: torvalds, walken, dhowells, akpm, linux-kernel

From: Tony Luck <tony.luck@intel.com>
Date: Tue, 17 Aug 2010 11:24:19 -0700

> On Tue, Aug 17, 2010 at 9:14 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> No. Looks like the rwsem changes broke sparc too. ia64 had some problems too.
> 
> I did have some similar mmap_sem issues - but the combination of
> fixing the types
> of the RWSEM_* defines to be unsigned, and the return value of
> ia64_atomic64_add()
> to be "long" rather than "int" looks to have cleared up the problems I
> was seeing. I
> could generally see the hung processes in less than 10 consecutive
> kernel builds, but
> ran a few thousand builds over the weekend with no issues.

You might be triggering it via threading as well, since make uses
vfork() for running sub-jobs.

> If git is multi-threaded, it may be hitting some different code path
> ... but it isn't trivial
> for me to try this out (my systems are on an isolated lab network segment).

Like you I tried fixing atomic64, but as I mentioned it turns out
sparc64 (like powerpc) always uses a 32-bit 'int' semaphore count.

I thought perhaps that for some reason the rwsem generic code had
a dependency on 'long' so I switched sparc64's rwsems over to 'long'
counters last night too, but I still get the problem.

Even reverting the rwsem commit that added the "<" test didn't fix
things, so now I'm simply going to bisect.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 21:28         ` Linus Torvalds
@ 2010-08-17 21:47           ` Michel Lespinasse
  2010-08-17 21:59             ` David Miller
  2010-08-17 23:40           ` David Miller
  1 sibling, 1 reply; 14+ messages in thread
From: Michel Lespinasse @ 2010-08-17 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Luck, David Miller, David Howells, Andrew Morton, linux-kernel

On Tue, Aug 17, 2010 at 2:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Oh, and it strikes me when looking at the sparc rwsem-const.h that it
> has some unsigned constants in it. I think
>
>  #define RWSEM_WAITING_BIAS              0xffff0000
>
> is actually unsigned, because it's a hex constant that doesn't fit in
> a signed int. So it's of type 'unsigned int' if I recall the C type
> rules right.

Eeeek, this totally escaped me when I had a quick look at constant
signed-ness across architectures. This is definitely a problem.

> #define RWSEM_WAITING_BIAS             (-0x00010000)

David, do you still see failures with v2.6.36-rc1 plus that one line fix ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 21:47           ` Michel Lespinasse
@ 2010-08-17 21:59             ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2010-08-17 21:59 UTC (permalink / raw)
  To: walken; +Cc: torvalds, tony.luck, dhowells, akpm, linux-kernel

From: Michel Lespinasse <walken@google.com>
Date: Tue, 17 Aug 2010 14:47:05 -0700

> On Tue, Aug 17, 2010 at 2:28 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Oh, and it strikes me when looking at the sparc rwsem-const.h that it
>> has some unsigned constants in it. I think
>>
>>  #define RWSEM_WAITING_BIAS              0xffff0000
>>
>> is actually unsigned, because it's a hex constant that doesn't fit in
>> a signed int. So it's of type 'unsigned int' if I recall the C type
>> rules right.
> 
> Eeeek, this totally escaped me when I had a quick look at constant
> signed-ness across architectures. This is definitely a problem.
> 
>> #define RWSEM_WAITING_BIAS             (-0x00010000)
> 
> David, do you still see failures with v2.6.36-rc1 plus that one line fix ?

As I said in a reply to Tony I don't even think your commit is
to blame, because even after reverting it I still have the
problems.

I'll keep digging into this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17 21:28         ` Linus Torvalds
  2010-08-17 21:47           ` Michel Lespinasse
@ 2010-08-17 23:40           ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2010-08-17 23:40 UTC (permalink / raw)
  To: torvalds; +Cc: tony.luck, walken, dhowells, akpm, linux-kernel

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 17 Aug 2010 14:28:49 -0700

> Oh, and it strikes me when looking at the sparc rwsem-const.h that it
> has some unsigned constants in it. I think
> 
>   #define RWSEM_WAITING_BIAS              0xffff0000
> 
> is actually unsigned, because it's a hex constant that doesn't fit in
> a signed int. So it's of type 'unsigned int' if I recall the C type
> rules right.

You were right all along, of course.  Nice spotting.

Something was messed up with my testing methodology last night such
that I didn't test the revert of the lib/rwsem.c changes properly.
Those actually did in fact introduce the problem.

And fixing these sparc rwsem values to be signed constants fixes the bug.

Thanks guys.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17  5:22 Dimitris Michailidis
  2010-08-17  5:35 ` David Miller
@ 2010-08-17  7:23 ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2010-08-17  7:23 UTC (permalink / raw)
  To: dm; +Cc: linux-kernel

From: Dimitris Michailidis <dm@chelsio.com>
Date: Mon, 16 Aug 2010 22:22:04 -0700

> This may be similar to the problems IA-64 had last week as a result of
> the rwsem changes, eg see commit
> 01d69a82e1d3c938da16bf55aab000672243aa24.  I see
> arch/sparc/include/asm/atomic_64.h uses ints with atomic64_t, which
> was a problem for IA-64.  Just guessing.

Well, the return types for the atomic64 routines were wrong and I
fixed that.

But our rwsem implementation on sparc64 uses a 32-bit counter so that
couldn't be the issue.  And indeed I still get the problem of tasks
being stuck on rwsems even after the atomic64 return types are fixed.

But I bet that rwsem commit 424acaaeb3a3932d64a9b4bd59df6cf72c22d8f3
introduced some kind of problem in another manner.  I'll try to dig
some more.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
  2010-08-17  5:22 Dimitris Michailidis
@ 2010-08-17  5:35 ` David Miller
  2010-08-17  7:23 ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2010-08-17  5:35 UTC (permalink / raw)
  To: dm; +Cc: linux-kernel

From: Dimitris Michailidis <dm@chelsio.com>
Date: Mon, 16 Aug 2010 22:22:04 -0700

> This may be similar to the problems IA-64 had last week as a result of
> the rwsem changes, eg see commit
> 01d69a82e1d3c938da16bf55aab000672243aa24.  I see
> arch/sparc/include/asm/atomic_64.h uses ints with atomic64_t, which
> was a problem for IA-64.  Just guessing.

Thanks Dimitris I'll take a look.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tasks getting stuck on mmap_sem?
@ 2010-08-17  5:22 Dimitris Michailidis
  2010-08-17  5:35 ` David Miller
  2010-08-17  7:23 ` David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Dimitris Michailidis @ 2010-08-17  5:22 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel

David Miller wrote:
> I just started seeing this with current GIT, is it related
> to the page table locking shuffle of a few days ago?
> 
> [21029.265646] INFO: task git:16643 blocked for more than 120 seconds.
> [21029.267054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [21029.268412] git           D 000000000072f804     0 16643  16273 0x210001101000080
> [21029.269809] Call Trace:
> [21029.271062]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
> [21029.272390]  [000000000072f804] __down_read+0x30/0x40
> [21029.273689]  [0000000000731148] do_sparc64_fault+0x238/0x788
> [21029.274996]  [00000000004079e8] sparc64_realfault_common+0x10/0x20
> [21029.276347] INFO: task git:16644 blocked for more than 120 seconds.
> [21029.277655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [21029.278993] git           D 000000000072f874     0 16644  16273 0x210041101000080
> [21029.280348] Call Trace:
> [21029.281542]  [000000000072f778] rwsem_down_failed_common+0x90/0xb8
> [21029.282818]  [000000000072f874] __down_write+0x34/0x44
> [21029.284057]  [00000000004d9c44] SyS_mmap_pgoff+0xc8/0x118
> [21029.285293]  [0000000000406154] linux_sparc_syscall32+0x34/0x40
>  ...
> 
> etc.  it's a set of git threads computing deltas for a remote pull
> request to the machine.

This may be similar to the problems IA-64 had last week as a result of the 
rwsem changes, eg see commit 01d69a82e1d3c938da16bf55aab000672243aa24.  I 
see arch/sparc/include/asm/atomic_64.h uses ints with atomic64_t, which was 
a problem for IA-64.  Just guessing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-08-17 23:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-17  4:12 tasks getting stuck on mmap_sem? David Miller
2010-08-17 16:14 ` Linus Torvalds
2010-08-17 18:24   ` Tony Luck
2010-08-17 20:46     ` Tony Luck
2010-08-17 21:08       ` Linus Torvalds
2010-08-17 21:26         ` Tony Luck
2010-08-17 21:28         ` Linus Torvalds
2010-08-17 21:47           ` Michel Lespinasse
2010-08-17 21:59             ` David Miller
2010-08-17 23:40           ` David Miller
2010-08-17 21:44     ` David Miller
2010-08-17  5:22 Dimitris Michailidis
2010-08-17  5:35 ` David Miller
2010-08-17  7:23 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).