All of lore.kernel.org
 help / color / mirror / Atom feed
* + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-02 19:25 akpm
  2014-07-09  4:03   ` Sasha Levin
  0 siblings, 1 reply; 56+ messages in thread
From: akpm @ 2014-07-02 19:25 UTC (permalink / raw)
  To: hughd, davej, koct9i, lczerner, sasha.levin, stable, vbabka, mm-commits


The patch titled
     Subject: shmem: fix faulting into a hole while it's punched, take 2
has been added to the -mm tree.  Its filename is
     shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Hugh Dickins <hughd@google.com>
Subject: shmem: fix faulting into a hole while it's punched, take 2

Trinity finds that mmap access to a hole while it's punched from shmem can
prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from
completing, until the (killable) reader stops; with the puncher's hold on
i_mutex locking out all other writers until it can complete.  This issue
was tagged with CVE-2014-4171.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.  It is not a problem in truncation,
because there the SIGBUS beyond i_size stops pages from being appended.

The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer in
truncate_inode_pages_range"), once it was duplicated into shmem.c.  It
seemed like a nice idea at the time, to ensure (barring RCU lookup
fuzziness) that there's an instant when the entire hole is empty; but the
indefinitely repeated scans to ensure that make it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple of
comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily by
the inner loop, and mem_cgroup_uncharge_start()/end() are too light to be
worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>	[3.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff -puN mm/shmem.c~shmem-fix-faulting-into-a-hole-while-its-punched-take-2 mm/shmem.c
--- a/mm/shmem.c~shmem-fix-faulting-into-a-hole-while-its-punched-take-2
+++ a/mm/shmem.c
@@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 
_

Patches currently in -mm which might be from hughd@google.com are

shmem-fix-init_page_accessed-use-to-stop-pagelru-bug.patch
revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
mm-fs-fix-pessimization-in-hole-punching-pagecache.patch
mm-memoryc-use-entry-=-access_oncepte-in-handle_pte_fault.patch
mm-memcontrol-fold-mem_cgroup_do_charge.patch
mm-memcontrol-rearrange-charging-fast-path.patch
mm-memcontrol-reclaim-at-least-once-for-__gfp_noretry.patch
mm-huge_memory-use-gfp_transhuge-when-charging-huge-pages.patch
mm-memcontrol-retry-reclaim-for-oom-disabled-and-__gfp_nofail-charges.patch
mm-memcontrol-remove-explicit-oom-parameter-in-charge-path.patch
mm-memcontrol-simplify-move-precharge-function.patch
mm-memcontrol-catch-root-bypass-in-move-precharge.patch
mm-memcontrol-use-root_mem_cgroup-res_counter.patch
mm-memcontrol-remove-ordering-between-pc-mem_cgroup-and-pagecgroupused.patch
mm-memcontrol-do-not-acquire-page_cgroup-lock-for-kmem-pages.patch
mm-memcontrol-rewrite-charge-api.patch
mm-memcontrol-rewrite-uncharge-api.patch
mm-memcontrol-rewrite-uncharge-api-fix-5.patch
mm-memcontrol-rewrite-charge-api-fix-shmem_unuse.patch
mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.patch
mm-vmallocc-add-a-schedule-point-to-vmalloc.patch
mm-vmallocc-add-a-schedule-point-to-vmalloc-fix.patch
include-linux-mmdebugh-add-vm_warn_once.patch
shmem-fix-double-uncharge-in-__shmem_file_setup.patch
shmem-update-memory-reservation-on-truncate.patch
mm-catch-memory-commitment-underflow.patch
mm-catch-memory-commitment-underflow-fix.patch
mm-export-nr_shmem-via-sysinfo2-si_meminfo-interfaces.patch
mm-replace-init_page_accessed-by-__setpagereferenced.patch
mm-zbud-zbud_alloc-minor-param-change.patch
mm-zbud-change-zbud_alloc-size-type-to-size_t.patch
mm-zpool-implement-common-zpool-api-to-zbud-zsmalloc.patch
mm-zpool-zbud-zsmalloc-implement-zpool.patch
mm-zpool-update-zswap-to-use-zpool.patch
mm-zpool-prevent-zbud-zsmalloc-from-unloading-when-used.patch
list-use-argument-hlist_add_after-names-from-rcu-variant.patch
list-fix-order-of-arguments-for-hlist_add_after_rcu.patch
klist-use-same-naming-scheme-as-hlist-for-klist_add_after.patch
mm-replace-remap_file_pages-syscall-with-emulation-fix-3.patch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-02 19:25 + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree akpm
@ 2014-07-09  4:03   ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-09  4:03 UTC (permalink / raw)
  To: akpm, hughd, davej, koct9i, lczerner, stable, Vlastimil Babka
  Cc: linux-mm, LKML

On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: shmem: fix faulting into a hole while it's punched, take 2
> has been added to the -mm tree.  Its filename is
>      shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> 
> This patch should soon appear at
>     http://ozlabs.org/~akpm/mmots/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> and later at
>     http://ozlabs.org/~akpm/mmotm/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Hugh Dickins <hughd@google.com>
> Subject: shmem: fix faulting into a hole while it's punched, take 2
> 
> Trinity finds that mmap access to a hole while it's punched from shmem can
> prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from
> completing, until the (killable) reader stops; with the puncher's hold on
> i_mutex locking out all other writers until it can complete.  This issue
> was tagged with CVE-2014-4171.
> 
> It appears that the tmpfs fault path is too light in comparison with its
> hole-punching path, lacking an i_data_sem to obstruct it; but we don't
> want to slow down the common case.  It is not a problem in truncation,
> because there the SIGBUS beyond i_size stops pages from being appended.
> 
> The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer in
> truncate_inode_pages_range"), once it was duplicated into shmem.c.  It
> seemed like a nice idea at the time, to ensure (barring RCU lookup
> fuzziness) that there's an instant when the entire hole is empty; but the
> indefinitely repeated scans to ensure that make it vulnerable.
> 
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple of
> comments there.
> 
> Remove the "indices[0] >= end" test: that is now handled satisfactorily by
> the inner loop, and mem_cgroup_uncharge_start()/end() are too light to be
> worth avoiding here.
> 
> But if we do not always loop indefinitely, we do need to handle the case
> of swap swizzled back to page before shmem_free_swap() gets it: add a
> retry for that case, as suggested by Konstantin Khlebnikov.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

I suspect there's something off with this patch, as the shmem_fallocate
hangs are back... Pretty much same as before:

[  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
[  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
[  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
[  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
[  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
[  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
[  363.633879] Call Trace:
[  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
[  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
[  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
[  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
[  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
[  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
[  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
[  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
[  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
[  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
[  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
[  363.682900] 2 locks held by trinity-c327/9203:
[  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
[  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09  4:03   ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-09  4:03 UTC (permalink / raw)
  To: akpm, hughd, davej, koct9i, lczerner, stable, Vlastimil Babka
  Cc: linux-mm, LKML

On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: shmem: fix faulting into a hole while it's punched, take 2
> has been added to the -mm tree.  Its filename is
>      shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> 
> This patch should soon appear at
>     http://ozlabs.org/~akpm/mmots/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> and later at
>     http://ozlabs.org/~akpm/mmotm/broken-out/shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Hugh Dickins <hughd@google.com>
> Subject: shmem: fix faulting into a hole while it's punched, take 2
> 
> Trinity finds that mmap access to a hole while it's punched from shmem can
> prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from
> completing, until the (killable) reader stops; with the puncher's hold on
> i_mutex locking out all other writers until it can complete.  This issue
> was tagged with CVE-2014-4171.
> 
> It appears that the tmpfs fault path is too light in comparison with its
> hole-punching path, lacking an i_data_sem to obstruct it; but we don't
> want to slow down the common case.  It is not a problem in truncation,
> because there the SIGBUS beyond i_size stops pages from being appended.
> 
> The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer in
> truncate_inode_pages_range"), once it was duplicated into shmem.c.  It
> seemed like a nice idea at the time, to ensure (barring RCU lookup
> fuzziness) that there's an instant when the entire hole is empty; but the
> indefinitely repeated scans to ensure that make it vulnerable.
> 
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple of
> comments there.
> 
> Remove the "indices[0] >= end" test: that is now handled satisfactorily by
> the inner loop, and mem_cgroup_uncharge_start()/end() are too light to be
> worth avoiding here.
> 
> But if we do not always loop indefinitely, we do need to handle the case
> of swap swizzled back to page before shmem_free_swap() gets it: add a
> retry for that case, as suggested by Konstantin Khlebnikov.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

I suspect there's something off with this patch, as the shmem_fallocate
hangs are back... Pretty much same as before:

[  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
[  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
[  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
[  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
[  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
[  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
[  363.633879] Call Trace:
[  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
[  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
[  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
[  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
[  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
[  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
[  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
[  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
[  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
[  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
[  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
[  363.682900] 2 locks held by trinity-c327/9203:
[  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
[  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09  4:03   ` Sasha Levin
@ 2014-07-09  6:35     ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-09  6:35 UTC (permalink / raw)
  To: Sasha Levin
  Cc: akpm, hughd, davej, koct9i, lczerner, stable, Vlastimil Babka,
	linux-mm, LKML

On Wed, 9 Jul 2014, Sasha Levin wrote:
> On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
> > From: Hugh Dickins <hughd@google.com>
> > Subject: shmem: fix faulting into a hole while it's punched, take 2
> 
> I suspect there's something off with this patch, as the shmem_fallocate
> hangs are back... Pretty much same as before:

Thank you for reporting, but that is depressing news.

I don't see what's wrong with this (take 2) patch,
and I don't see that it's been garbled in any way in next-20140708.

> 
> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
> [  363.633879] Call Trace:
> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350

So it's trying to acquire i_mutex at shmem_fallocate+0x6c...

> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
> [  363.682900] 2 locks held by trinity-c327/9203:
> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350

...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
Am I reading that correctly?

In my source for next-20140708, the only return from shmem_fallocate()
which omits to mutex_unlock(&inode->i_mutex) is the "return -EOPNOTSUPP"
at the top, just before the mutex_lock(&inode->i_mutex).  And inode
doesn't get reassigned in the middle.

Does 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty look different?

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09  6:35     ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-09  6:35 UTC (permalink / raw)
  To: Sasha Levin
  Cc: akpm, hughd, davej, koct9i, lczerner, stable, Vlastimil Babka,
	linux-mm, LKML

On Wed, 9 Jul 2014, Sasha Levin wrote:
> On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
> > From: Hugh Dickins <hughd@google.com>
> > Subject: shmem: fix faulting into a hole while it's punched, take 2
> 
> I suspect there's something off with this patch, as the shmem_fallocate
> hangs are back... Pretty much same as before:

Thank you for reporting, but that is depressing news.

I don't see what's wrong with this (take 2) patch,
and I don't see that it's been garbled in any way in next-20140708.

> 
> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
> [  363.633879] Call Trace:
> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350

So it's trying to acquire i_mutex at shmem_fallocate+0x6c...

> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
> [  363.682900] 2 locks held by trinity-c327/9203:
> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350

...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
Am I reading that correctly?

In my source for next-20140708, the only return from shmem_fallocate()
which omits to mutex_unlock(&inode->i_mutex) is the "return -EOPNOTSUPP"
at the top, just before the mutex_lock(&inode->i_mutex).  And inode
doesn't get reassigned in the middle.

Does 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty look different?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09  6:35     ` Hugh Dickins
@ 2014-07-09  9:50       ` Vlastimil Babka
  -1 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-09  9:50 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 08:35 AM, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Sasha Levin wrote:
>> On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
>>> From: Hugh Dickins <hughd@google.com>
>>> Subject: shmem: fix faulting into a hole while it's punched, take 2
>>
>> I suspect there's something off with this patch, as the shmem_fallocate
>> hangs are back... Pretty much same as before:
>
> Thank you for reporting, but that is depressing news.
>
> I don't see what's wrong with this (take 2) patch,
> and I don't see that it's been garbled in any way in next-20140708.
>
>>
>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>> [  363.633879] Call Trace:
>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>
>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>> [  363.682900] 2 locks held by trinity-c327/9203:
>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> Am I reading that correctly?

I wonder, why wouldn't lockdep fire here if it was a double lock? I 
assume lockdep is enabled. It seems to me that the lock #1 is being 
printed because it's being acquired, not because it already is acquired. 
__mutex_lock_common() calls mutex_acquire_nest() *before* it actually 
tries to acquire the mutex. So the output is just confusing.

So it would again help to see stacks of other tasks, to see who holds 
the i_mutex and where it's stuck...

Vlastimil

> In my source for next-20140708, the only return from shmem_fallocate()
> which omits to mutex_unlock(&inode->i_mutex) is the "return -EOPNOTSUPP"
> at the top, just before the mutex_lock(&inode->i_mutex).  And inode
> doesn't get reassigned in the middle.
>
> Does 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty look different?
>
> Hugh
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09  9:50       ` Vlastimil Babka
  0 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-09  9:50 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 08:35 AM, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Sasha Levin wrote:
>> On 07/02/2014 03:25 PM, akpm@linux-foundation.org wrote:
>>> From: Hugh Dickins <hughd@google.com>
>>> Subject: shmem: fix faulting into a hole while it's punched, take 2
>>
>> I suspect there's something off with this patch, as the shmem_fallocate
>> hangs are back... Pretty much same as before:
>
> Thank you for reporting, but that is depressing news.
>
> I don't see what's wrong with this (take 2) patch,
> and I don't see that it's been garbled in any way in next-20140708.
>
>>
>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>> [  363.633879] Call Trace:
>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>
>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>> [  363.682900] 2 locks held by trinity-c327/9203:
>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> Am I reading that correctly?

I wonder, why wouldn't lockdep fire here if it was a double lock? I 
assume lockdep is enabled. It seems to me that the lock #1 is being 
printed because it's being acquired, not because it already is acquired. 
__mutex_lock_common() calls mutex_acquire_nest() *before* it actually 
tries to acquire the mutex. So the output is just confusing.

So it would again help to see stacks of other tasks, to see who holds 
the i_mutex and where it's stuck...

Vlastimil

> In my source for next-20140708, the only return from shmem_fallocate()
> which omits to mutex_unlock(&inode->i_mutex) is the "return -EOPNOTSUPP"
> at the top, just before the mutex_lock(&inode->i_mutex).  And inode
> doesn't get reassigned in the middle.
>
> Does 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty look different?
>
> Hugh
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09  9:50       ` Vlastimil Babka
@ 2014-07-09 12:47         ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-09 12:47 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 05:50 AM, Vlastimil Babka wrote:
> On 07/09/2014 08:35 AM, Hugh Dickins wrote:
>> On Wed, 9 Jul 2014, Sasha Levin wrote:
>>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>>> [  363.633879] Call Trace:
>>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>
>> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>>
>>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>>> [  363.682900] 2 locks held by trinity-c327/9203:
>>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>
>> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
>> Am I reading that correctly?
> 
> I wonder, why wouldn't lockdep fire here if it was a double lock? I assume lockdep is enabled. It seems to me that the lock #1 is being printed because it's being acquired, not because it already is acquired. __mutex_lock_common() calls mutex_acquire_nest() *before* it actually tries to acquire the mutex. So the output is just confusing.

Nope, it's not a double lock - it's easy to misread lockdep output here.

lockdep marks locks as held even before they are actually acquired:

	static __always_inline int __sched
	__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
	                    struct lockdep_map *nest_lock, unsigned long ip,
	                    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
	{
	        struct task_struct *task = current;
	        struct mutex_waiter waiter;
	        unsigned long flags;
	        int ret;

	        preempt_disable();
	        mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip); <=== Lock marked as acquired

This is done to avoid races where the lock is actually acquired but not showing up
in lockdep.

So this trace should be read as: "We acquired sb_writers in do_fallocate() and are
blocked waiting to lock i_mutex in shmem_fallocate".

> So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...

The stacks print got garbled due to having large amount of tasks and too low of a
console buffer. I've fixed that and will update when (if) the problem reproduces.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09 12:47         ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-09 12:47 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 05:50 AM, Vlastimil Babka wrote:
> On 07/09/2014 08:35 AM, Hugh Dickins wrote:
>> On Wed, 9 Jul 2014, Sasha Levin wrote:
>>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>>> [  363.633879] Call Trace:
>>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>
>> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>>
>>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>>> [  363.682900] 2 locks held by trinity-c327/9203:
>>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>
>> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
>> Am I reading that correctly?
> 
> I wonder, why wouldn't lockdep fire here if it was a double lock? I assume lockdep is enabled. It seems to me that the lock #1 is being printed because it's being acquired, not because it already is acquired. __mutex_lock_common() calls mutex_acquire_nest() *before* it actually tries to acquire the mutex. So the output is just confusing.

Nope, it's not a double lock - it's easy to misread lockdep output here.

lockdep marks locks as held even before they are actually acquired:

	static __always_inline int __sched
	__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
	                    struct lockdep_map *nest_lock, unsigned long ip,
	                    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
	{
	        struct task_struct *task = current;
	        struct mutex_waiter waiter;
	        unsigned long flags;
	        int ret;

	        preempt_disable();
	        mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip); <=== Lock marked as acquired

This is done to avoid races where the lock is actually acquired but not showing up
in lockdep.

So this trace should be read as: "We acquired sb_writers in do_fallocate() and are
blocked waiting to lock i_mutex in shmem_fallocate".

> So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...

The stacks print got garbled due to having large amount of tasks and too low of a
console buffer. I've fixed that and will update when (if) the problem reproduces.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09 12:47         ` Sasha Levin
  (?)
@ 2014-07-09 16:03         ` Sasha Levin
  2014-07-09 16:35             ` Vlastimil Babka
  2014-07-10  7:37             ` Hugh Dickins
  -1 siblings, 2 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-09 16:03 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 8281 bytes --]

On 07/09/2014 08:47 AM, Sasha Levin wrote:
>> > So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...
> The stacks print got garbled due to having large amount of tasks and too low of a
> console buffer. I've fixed that and will update when (if) the problem reproduces.

Okay, so the issue reproduces on today's -next as well, and here's my analysis.

Hung task timer was triggered for trinity-c37:

[  483.991095] INFO: task trinity-c37:8968 blocked for more than 120 seconds.
[  483.995898]       Not tainted 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775
[  484.000405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.004961] trinity-c37     D 00000000ffffffff 13160  8968   8558 0x10000000
[  484.009035]  ffff8800c0457ce8 0000000000000006 ffffffff9ac1d920 0000000000000001
[  484.012654]  ffff8800c0457fd8 00000000001e2740 00000000001e2740 00000000001e2740
[  484.015716]  ffff880201610000 ffff8800c0bc3000 ffff8800c0457cd8 ffff880223115bb0
[  484.050723] Call Trace:
[  484.051831] schedule (kernel/sched/core.c:2841)
[  484.053683] schedule_preempt_disabled (kernel/sched/core.c:2868)
[  484.055979] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
[  484.058175] ? shmem_fallocate (mm/shmem.c:1760)
[  484.060441] ? get_parent_ip (kernel/sched/core.c:2555)
[  484.063899] ? shmem_fallocate (mm/shmem.c:1760)
[  484.067485] shmem_fallocate (mm/shmem.c:1760)
[  484.071113] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  484.075128] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  484.078566] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
[  484.082074] tracesys (arch/x86/kernel/entry_64.S:542)
[  484.150284] 2 locks held by trinity-c37/8968:
[  484.152995] #0: (sb_writers#16){.+.+.+}, at: do_fallocate (fs/open.c:298)
[  484.158692] #1: (&sb->s_type->i_mutex_key#18){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1760)

It has acquired sb_writers lock ("sb_start_write(inode->i_sb);") in do_fallocate and
is blocking on an attempt to acquire i_mutex in shmem_fallocate():

        if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
                return -EOPNOTSUPP;

        mutex_lock(&inode->i_mutex); <=== HERE

        if (mode & FALLOC_FL_PUNCH_HOLE) {

Now, looking into what actually holds i_mutex rather than waiting on it we see trinity-c507:

[  488.014804] trinity-c507    D 0000000000000002 13112  9445   8558 0x10000002
[  488.014860]  ffff88010391fab8 0000000000000002 ffffffff9abff6b0 0000000000000002
[  488.014894]  ffff88010391ffd8 00000000001e2740 00000000001e2740 00000000001e2740
[  488.014912]  ffff8800be3f8000 ffff88010389b000 ffff88010391faa8 ffff880223115d58
[  488.014942] Call Trace:
[  488.014948] schedule (kernel/sched/core.c:2841)
[  488.014960] schedule_preempt_disabled (kernel/sched/core.c:2868)
[  488.014970] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
[  488.014983] ? unmap_mapping_range (mm/memory.c:2392)
[  488.014992] ? unmap_mapping_range (mm/memory.c:2392)
[  488.015005] unmap_mapping_range (mm/memory.c:2392)
[  488.015021] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[  488.015041] shmem_undo_range (mm/shmem.c:441)
[  488.015059] shmem_truncate_range (mm/shmem.c:537)
[  488.015069] shmem_fallocate (mm/shmem.c:1771)
[  488.015079] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  488.015098] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  488.015110] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
[  488.015131] tracesys (arch/x86/kernel/entry_64.S:542)

It has acquired i_mutex lock in shmem_fallocate() and is now blocking on i_mmap_mutex
in unmap_mapping_range():

        details.check_mapping = even_cows? NULL: mapping;
        details.nonlinear_vma = NULL;
        details.first_index = hba;
        details.last_index = hba + hlen - 1;
        if (details.last_index < details.first_index)
                details.last_index = ULONG_MAX;


        mutex_lock(&mapping->i_mmap_mutex); <==== HERE
        if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
                unmap_mapping_range_tree(&mapping->i_mmap, &details);

The only process that actually holds a i_mmap_mutex (instead of just spinning on it)
is trinity-c402:

[  487.925991] trinity-c402    R  running task    13160  9339   8558 0x10000000
[  487.926007]  ffff8800b7eb7b88 0000000000000002 ffff88006efe3290 0000000000000282
[  487.926013]  ffff8800b7eb7fd8 00000000001e2740 00000000001e2740 00000000001e2740
[  487.926022]  ffff8800362b3000 ffff8800b7eb8000 ffff8800b7eb7b88 ffff8800b7eb7fd8
[  487.926028] Call Trace:
[  487.926030] preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)
[  487.926034] ___preempt_schedule (arch/x86/kernel/preempt.S:11)
[  487.926039] ? zap_pte_range (mm/memory.c:1218)
[  487.926042] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  487.926045] ? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  487.926049] zap_pte_range (mm/memory.c:1218)
[  487.926056] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
[  487.926060] unmap_vmas (mm/memory.c:1375 (discriminator 1))
[  487.926066] exit_mmap (mm/mmap.c:2802)
[  487.926069] ? preempt_count_sub (kernel/sched/core.c:2611)
[  487.926075] mmput (kernel/fork.c:638)
[  487.926079] do_exit (kernel/exit.c:744)
[  487.926086] do_group_exit (kernel/exit.c:884)
[  487.926091] SyS_exit_group (kernel/exit.c:895)
[  487.926095] tracesys (arch/x86/kernel/entry_64.S:542)

We can see that it's not blocked since it's in the middle of a spinlock unlock
call, and we can guess it's been in that function for a while because of the hung
task timer, and other processes waiting on that i_mmap_mutex:


[  487.857145] trinity-c338    D 0000000000000008 12904  9274   8558 0x10000004
[  487.857179]  ffff8800b9bcbaa8 0000000000000002 ffffffff9abff6b0 0000000000000000
[  487.857193]  ffff8800b9bcbfd8 00000000001e2740 00000000001e2740 00000000001e2740
[  487.857202]  ffff880107198000 ffff8800b9bc0000 ffff8800b9bcba98 ffff8801ec17f090
[  487.857209] Call Trace:
[  487.857210] schedule (kernel/sched/core.c:2841)
[  487.857222] schedule_preempt_disabled (kernel/sched/core.c:2868)
[  487.857228] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
[  487.857237] ? unlink_file_vma (mm/mmap.c:245)
[  487.857241] ? unlink_file_vma (mm/mmap.c:245)
[  487.857251] unlink_file_vma (mm/mmap.c:245)
[  487.857261] free_pgtables (mm/memory.c:540)
[  487.857275] exit_mmap (mm/mmap.c:2803)
[  487.857284] ? preempt_count_sub (kernel/sched/core.c:2611)
[  487.857291] mmput (kernel/fork.c:638)
[  487.857305] do_exit (kernel/exit.c:744)
[  487.857317] ? get_signal_to_deliver (kernel/signal.c:2333)
[  487.857327] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  487.857339] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  487.857351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[  487.857362] do_group_exit (kernel/exit.c:884)
[  487.857376] get_signal_to_deliver (kernel/signal.c:2351)
[  487.857392] do_signal (arch/x86/kernel/signal.c:698)
[  487.857399] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  487.857411] ? vtime_account_user (kernel/sched/cputime.c:687)
[  487.857418] ? preempt_count_sub (kernel/sched/core.c:2611)
[  487.857431] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  487.857441] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  487.857451] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[  487.857464] do_notify_resume (arch/x86/kernel/signal.c:751)
[  487.857479] int_signal (arch/x86/kernel/entry_64.S:600)

Hope that helps. Full log attached for reference.


Thanks,
Sasha


[-- Attachment #2: log.gz --]
[-- Type: application/gzip, Size: 1419339 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09 16:03         ` Sasha Levin
@ 2014-07-09 16:35             ` Vlastimil Babka
  2014-07-10  7:37             ` Hugh Dickins
  1 sibling, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-09 16:35 UTC (permalink / raw)
  To: Sasha Levin, Hugh Dickins
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 06:03 PM, Sasha Levin wrote:
> On 07/09/2014 08:47 AM, Sasha Levin wrote:
>>>> So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...
>> The stacks print got garbled due to having large amount of tasks and too low of a
>> console buffer. I've fixed that and will update when (if) the problem reproduces.
>
> Okay, so the issue reproduces on today's -next as well, and here's my analysis.
>
> Hung task timer was triggered for trinity-c37:
>
> [  483.991095] INFO: task trinity-c37:8968 blocked for more than 120 seconds.
> [  483.995898]       Not tainted 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775
> [  484.000405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.004961] trinity-c37     D 00000000ffffffff 13160  8968   8558 0x10000000
> [  484.009035]  ffff8800c0457ce8 0000000000000006 ffffffff9ac1d920 0000000000000001
> [  484.012654]  ffff8800c0457fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  484.015716]  ffff880201610000 ffff8800c0bc3000 ffff8800c0457cd8 ffff880223115bb0
> [  484.050723] Call Trace:
> [  484.051831] schedule (kernel/sched/core.c:2841)
> [  484.053683] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  484.055979] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  484.058175] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.060441] ? get_parent_ip (kernel/sched/core.c:2555)
> [  484.063899] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.067485] shmem_fallocate (mm/shmem.c:1760)
> [  484.071113] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  484.075128] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  484.078566] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  484.082074] tracesys (arch/x86/kernel/entry_64.S:542)
> [  484.150284] 2 locks held by trinity-c37/8968:
> [  484.152995] #0: (sb_writers#16){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  484.158692] #1: (&sb->s_type->i_mutex_key#18){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1760)
>
> It has acquired sb_writers lock ("sb_start_write(inode->i_sb);") in do_fallocate and
> is blocking on an attempt to acquire i_mutex in shmem_fallocate():
>
>          if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>                  return -EOPNOTSUPP;
>
>          mutex_lock(&inode->i_mutex); <=== HERE
>
>          if (mode & FALLOC_FL_PUNCH_HOLE) {
>
> Now, looking into what actually holds i_mutex rather than waiting on it we see trinity-c507:
>
> [  488.014804] trinity-c507    D 0000000000000002 13112  9445   8558 0x10000002
> [  488.014860]  ffff88010391fab8 0000000000000002 ffffffff9abff6b0 0000000000000002
> [  488.014894]  ffff88010391ffd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  488.014912]  ffff8800be3f8000 ffff88010389b000 ffff88010391faa8 ffff880223115d58
> [  488.014942] Call Trace:
> [  488.014948] schedule (kernel/sched/core.c:2841)
> [  488.014960] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  488.014970] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  488.014983] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.014992] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.015005] unmap_mapping_range (mm/memory.c:2392)
> [  488.015021] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  488.015041] shmem_undo_range (mm/shmem.c:441)
> [  488.015059] shmem_truncate_range (mm/shmem.c:537)
> [  488.015069] shmem_fallocate (mm/shmem.c:1771)
> [  488.015079] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  488.015098] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  488.015110] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  488.015131] tracesys (arch/x86/kernel/entry_64.S:542)
>
> It has acquired i_mutex lock in shmem_fallocate() and is now blocking on i_mmap_mutex
> in unmap_mapping_range():
>
>          details.check_mapping = even_cows? NULL: mapping;
>          details.nonlinear_vma = NULL;
>          details.first_index = hba;
>          details.last_index = hba + hlen - 1;
>          if (details.last_index < details.first_index)
>                  details.last_index = ULONG_MAX;
>
>
>          mutex_lock(&mapping->i_mmap_mutex); <==== HERE
>          if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                  unmap_mapping_range_tree(&mapping->i_mmap, &details);
>
> The only process that actually holds a i_mmap_mutex (instead of just spinning on it)
> is trinity-c402:
>
> [  487.925991] trinity-c402    R  running task    13160  9339   8558 0x10000000
> [  487.926007]  ffff8800b7eb7b88 0000000000000002 ffff88006efe3290 0000000000000282
> [  487.926013]  ffff8800b7eb7fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.926022]  ffff8800362b3000 ffff8800b7eb8000 ffff8800b7eb7b88 ffff8800b7eb7fd8
> [  487.926028] Call Trace:
> [  487.926030] preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)
> [  487.926034] ___preempt_schedule (arch/x86/kernel/preempt.S:11)
> [  487.926039] ? zap_pte_range (mm/memory.c:1218)
> [  487.926042] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926045] ? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926049] zap_pte_range (mm/memory.c:1218)
> [  487.926056] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
> [  487.926060] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  487.926066] exit_mmap (mm/mmap.c:2802)
> [  487.926069] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.926075] mmput (kernel/fork.c:638)
> [  487.926079] do_exit (kernel/exit.c:744)
> [  487.926086] do_group_exit (kernel/exit.c:884)
> [  487.926091] SyS_exit_group (kernel/exit.c:895)
> [  487.926095] tracesys (arch/x86/kernel/entry_64.S:542)
>
> We can see that it's not blocked since it's in the middle of a spinlock unlock
> call, and we can guess it's been in that function for a while because of the hung
> task timer, and other processes waiting on that i_mmap_mutex:

Hm, zap_pte_range has potentially an endless loop due to the 'goto 
again' path. Could it be a somewhat similar situation to the fallocate 
problem, but where parallel faulters on shared memory are preventing a 
process from exiting? Although they don't fault the pages into the same 
address space, they could maybe somehow interact through the TLB 
flushing code? And only after fixing the original problem we can observe 
this one?

>
> [  487.857145] trinity-c338    D 0000000000000008 12904  9274   8558 0x10000004
> [  487.857179]  ffff8800b9bcbaa8 0000000000000002 ffffffff9abff6b0 0000000000000000
> [  487.857193]  ffff8800b9bcbfd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.857202]  ffff880107198000 ffff8800b9bc0000 ffff8800b9bcba98 ffff8801ec17f090
> [  487.857209] Call Trace:
> [  487.857210] schedule (kernel/sched/core.c:2841)
> [  487.857222] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  487.857228] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  487.857237] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857241] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857251] unlink_file_vma (mm/mmap.c:245)
> [  487.857261] free_pgtables (mm/memory.c:540)
> [  487.857275] exit_mmap (mm/mmap.c:2803)
> [  487.857284] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857291] mmput (kernel/fork.c:638)
> [  487.857305] do_exit (kernel/exit.c:744)
> [  487.857317] ? get_signal_to_deliver (kernel/signal.c:2333)
> [  487.857327] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  487.857339] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
> [  487.857362] do_group_exit (kernel/exit.c:884)
> [  487.857376] get_signal_to_deliver (kernel/signal.c:2351)
> [  487.857392] do_signal (arch/x86/kernel/signal.c:698)
> [  487.857399] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857411] ? vtime_account_user (kernel/sched/cputime.c:687)
> [  487.857418] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857431] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  487.857441] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  487.857451] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [  487.857464] do_notify_resume (arch/x86/kernel/signal.c:751)
> [  487.857479] int_signal (arch/x86/kernel/entry_64.S:600)
>
> Hope that helps. Full log attached for reference.
>
>
> Thanks,
> Sasha
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09 16:35             ` Vlastimil Babka
  0 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-09 16:35 UTC (permalink / raw)
  To: Sasha Levin, Hugh Dickins
  Cc: akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/09/2014 06:03 PM, Sasha Levin wrote:
> On 07/09/2014 08:47 AM, Sasha Levin wrote:
>>>> So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...
>> The stacks print got garbled due to having large amount of tasks and too low of a
>> console buffer. I've fixed that and will update when (if) the problem reproduces.
>
> Okay, so the issue reproduces on today's -next as well, and here's my analysis.
>
> Hung task timer was triggered for trinity-c37:
>
> [  483.991095] INFO: task trinity-c37:8968 blocked for more than 120 seconds.
> [  483.995898]       Not tainted 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775
> [  484.000405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.004961] trinity-c37     D 00000000ffffffff 13160  8968   8558 0x10000000
> [  484.009035]  ffff8800c0457ce8 0000000000000006 ffffffff9ac1d920 0000000000000001
> [  484.012654]  ffff8800c0457fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  484.015716]  ffff880201610000 ffff8800c0bc3000 ffff8800c0457cd8 ffff880223115bb0
> [  484.050723] Call Trace:
> [  484.051831] schedule (kernel/sched/core.c:2841)
> [  484.053683] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  484.055979] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  484.058175] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.060441] ? get_parent_ip (kernel/sched/core.c:2555)
> [  484.063899] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.067485] shmem_fallocate (mm/shmem.c:1760)
> [  484.071113] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  484.075128] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  484.078566] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  484.082074] tracesys (arch/x86/kernel/entry_64.S:542)
> [  484.150284] 2 locks held by trinity-c37/8968:
> [  484.152995] #0: (sb_writers#16){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  484.158692] #1: (&sb->s_type->i_mutex_key#18){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1760)
>
> It has acquired sb_writers lock ("sb_start_write(inode->i_sb);") in do_fallocate and
> is blocking on an attempt to acquire i_mutex in shmem_fallocate():
>
>          if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>                  return -EOPNOTSUPP;
>
>          mutex_lock(&inode->i_mutex); <=== HERE
>
>          if (mode & FALLOC_FL_PUNCH_HOLE) {
>
> Now, looking into what actually holds i_mutex rather than waiting on it we see trinity-c507:
>
> [  488.014804] trinity-c507    D 0000000000000002 13112  9445   8558 0x10000002
> [  488.014860]  ffff88010391fab8 0000000000000002 ffffffff9abff6b0 0000000000000002
> [  488.014894]  ffff88010391ffd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  488.014912]  ffff8800be3f8000 ffff88010389b000 ffff88010391faa8 ffff880223115d58
> [  488.014942] Call Trace:
> [  488.014948] schedule (kernel/sched/core.c:2841)
> [  488.014960] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  488.014970] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  488.014983] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.014992] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.015005] unmap_mapping_range (mm/memory.c:2392)
> [  488.015021] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  488.015041] shmem_undo_range (mm/shmem.c:441)
> [  488.015059] shmem_truncate_range (mm/shmem.c:537)
> [  488.015069] shmem_fallocate (mm/shmem.c:1771)
> [  488.015079] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  488.015098] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  488.015110] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  488.015131] tracesys (arch/x86/kernel/entry_64.S:542)
>
> It has acquired i_mutex lock in shmem_fallocate() and is now blocking on i_mmap_mutex
> in unmap_mapping_range():
>
>          details.check_mapping = even_cows? NULL: mapping;
>          details.nonlinear_vma = NULL;
>          details.first_index = hba;
>          details.last_index = hba + hlen - 1;
>          if (details.last_index < details.first_index)
>                  details.last_index = ULONG_MAX;
>
>
>          mutex_lock(&mapping->i_mmap_mutex); <==== HERE
>          if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                  unmap_mapping_range_tree(&mapping->i_mmap, &details);
>
> The only process that actually holds a i_mmap_mutex (instead of just spinning on it)
> is trinity-c402:
>
> [  487.925991] trinity-c402    R  running task    13160  9339   8558 0x10000000
> [  487.926007]  ffff8800b7eb7b88 0000000000000002 ffff88006efe3290 0000000000000282
> [  487.926013]  ffff8800b7eb7fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.926022]  ffff8800362b3000 ffff8800b7eb8000 ffff8800b7eb7b88 ffff8800b7eb7fd8
> [  487.926028] Call Trace:
> [  487.926030] preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)
> [  487.926034] ___preempt_schedule (arch/x86/kernel/preempt.S:11)
> [  487.926039] ? zap_pte_range (mm/memory.c:1218)
> [  487.926042] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926045] ? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926049] zap_pte_range (mm/memory.c:1218)
> [  487.926056] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
> [  487.926060] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  487.926066] exit_mmap (mm/mmap.c:2802)
> [  487.926069] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.926075] mmput (kernel/fork.c:638)
> [  487.926079] do_exit (kernel/exit.c:744)
> [  487.926086] do_group_exit (kernel/exit.c:884)
> [  487.926091] SyS_exit_group (kernel/exit.c:895)
> [  487.926095] tracesys (arch/x86/kernel/entry_64.S:542)
>
> We can see that it's not blocked since it's in the middle of a spinlock unlock
> call, and we can guess it's been in that function for a while because of the hung
> task timer, and other processes waiting on that i_mmap_mutex:

Hm, zap_pte_range has potentially an endless loop due to the 'goto 
again' path. Could it be a somewhat similar situation to the fallocate 
problem, but where parallel faulters on shared memory are preventing a 
process from exiting? Although they don't fault the pages into the same 
address space, they could maybe somehow interact through the TLB 
flushing code? And only after fixing the original problem we can observe 
this one?

>
> [  487.857145] trinity-c338    D 0000000000000008 12904  9274   8558 0x10000004
> [  487.857179]  ffff8800b9bcbaa8 0000000000000002 ffffffff9abff6b0 0000000000000000
> [  487.857193]  ffff8800b9bcbfd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.857202]  ffff880107198000 ffff8800b9bc0000 ffff8800b9bcba98 ffff8801ec17f090
> [  487.857209] Call Trace:
> [  487.857210] schedule (kernel/sched/core.c:2841)
> [  487.857222] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  487.857228] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  487.857237] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857241] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857251] unlink_file_vma (mm/mmap.c:245)
> [  487.857261] free_pgtables (mm/memory.c:540)
> [  487.857275] exit_mmap (mm/mmap.c:2803)
> [  487.857284] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857291] mmput (kernel/fork.c:638)
> [  487.857305] do_exit (kernel/exit.c:744)
> [  487.857317] ? get_signal_to_deliver (kernel/signal.c:2333)
> [  487.857327] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  487.857339] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
> [  487.857362] do_group_exit (kernel/exit.c:884)
> [  487.857376] get_signal_to_deliver (kernel/signal.c:2351)
> [  487.857392] do_signal (arch/x86/kernel/signal.c:698)
> [  487.857399] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857411] ? vtime_account_user (kernel/sched/cputime.c:687)
> [  487.857418] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857431] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  487.857441] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  487.857451] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [  487.857464] do_notify_resume (arch/x86/kernel/signal.c:751)
> [  487.857479] int_signal (arch/x86/kernel/entry_64.S:600)
>
> Hope that helps. Full log attached for reference.
>
>
> Thanks,
> Sasha
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09 16:35             ` Vlastimil Babka
@ 2014-07-09 17:05               ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-09 17:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Hugh Dickins, akpm, davej, koct9i, lczerner, stable,
	linux-mm, LKML

On Wed, 9 Jul 2014, Vlastimil Babka wrote:
> On 07/09/2014 06:03 PM, Sasha Levin wrote:
> > 
> > We can see that it's not blocked since it's in the middle of a spinlock
> > unlock
> > call, and we can guess it's been in that function for a while because of
> > the hung
> > task timer, and other processes waiting on that i_mmap_mutex:
> 
> Hm, zap_pte_range has potentially an endless loop due to the 'goto again'
> path. Could it be a somewhat similar situation to the fallocate problem, but
> where parallel faulters on shared memory are preventing a process from
> exiting? Although they don't fault the pages into the same address space,
> they could maybe somehow interact through the TLB flushing code? And only
> after fixing the original problem we can observe this one?

That's a good thought.  It ought to make forward progress nonetheless,
but I believe (please check, I'm rushing) that there's an off-by-one in
that path which could leave us hanging - but only when __tlb_remove_page()
repeatedly fails, which would only happen if exceptionally low on memory??

Does this patch look good, and does it make any difference to the hang?

--- mmotm/mm/memory.c	2014-07-02 15:32:22.212311544 -0700
+++ linux/mm/memory.c	2014-07-09 09:56:33.724159443 -0700
@@ -1145,6 +1145,7 @@ again:
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
+				addr += PAGE_SIZE;
 				force_flush = 1;
 				break;
 			}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-09 17:05               ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-09 17:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Hugh Dickins, akpm, davej, koct9i, lczerner, stable,
	linux-mm, LKML

On Wed, 9 Jul 2014, Vlastimil Babka wrote:
> On 07/09/2014 06:03 PM, Sasha Levin wrote:
> > 
> > We can see that it's not blocked since it's in the middle of a spinlock
> > unlock
> > call, and we can guess it's been in that function for a while because of
> > the hung
> > task timer, and other processes waiting on that i_mmap_mutex:
> 
> Hm, zap_pte_range has potentially an endless loop due to the 'goto again'
> path. Could it be a somewhat similar situation to the fallocate problem, but
> where parallel faulters on shared memory are preventing a process from
> exiting? Although they don't fault the pages into the same address space,
> they could maybe somehow interact through the TLB flushing code? And only
> after fixing the original problem we can observe this one?

That's a good thought.  It ought to make forward progress nonetheless,
but I believe (please check, I'm rushing) that there's an off-by-one in
that path which could leave us hanging - but only when __tlb_remove_page()
repeatedly fails, which would only happen if exceptionally low on memory??

Does this patch look good, and does it make any difference to the hang?

--- mmotm/mm/memory.c	2014-07-02 15:32:22.212311544 -0700
+++ linux/mm/memory.c	2014-07-09 09:56:33.724159443 -0700
@@ -1145,6 +1145,7 @@ again:
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
+				addr += PAGE_SIZE;
 				force_flush = 1;
 				break;
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09 17:05               ` Hugh Dickins
@ 2014-07-10  1:04                 ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10  1:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Wed, 9 Jul 2014, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Vlastimil Babka wrote:
> > On 07/09/2014 06:03 PM, Sasha Levin wrote:
> > > 
> > > We can see that it's not blocked since it's in the middle of a spinlock
> > > unlock
> > > call, and we can guess it's been in that function for a while because of
> > > the hung
> > > task timer, and other processes waiting on that i_mmap_mutex:
> > 
> > Hm, zap_pte_range has potentially an endless loop due to the 'goto again'
> > path. Could it be a somewhat similar situation to the fallocate problem, but
> > where parallel faulters on shared memory are preventing a process from
> > exiting? Although they don't fault the pages into the same address space,
> > they could maybe somehow interact through the TLB flushing code? And only
> > after fixing the original problem we can observe this one?
> 
> That's a good thought.  It ought to make forward progress nonetheless,
> but I believe (please check, I'm rushing) that there's an off-by-one in
> that path which could leave us hanging - but only when __tlb_remove_page()
> repeatedly fails, which would only happen if exceptionally low on memory??
> 
> Does this patch look good, and does it make any difference to the hang?

I should add that I think that this patch is correct in itself, but
won't actually make any difference to anything.  I'm still looking
through Sasha's log for clues (but shall have to give up soon).

Hugh

> 
> --- mmotm/mm/memory.c	2014-07-02 15:32:22.212311544 -0700
> +++ linux/mm/memory.c	2014-07-09 09:56:33.724159443 -0700
> @@ -1145,6 +1145,7 @@ again:
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			if (unlikely(!__tlb_remove_page(tlb, page))) {
> +				addr += PAGE_SIZE;
>  				force_flush = 1;
>  				break;
>  			}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10  1:04                 ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10  1:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Wed, 9 Jul 2014, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Vlastimil Babka wrote:
> > On 07/09/2014 06:03 PM, Sasha Levin wrote:
> > > 
> > > We can see that it's not blocked since it's in the middle of a spinlock
> > > unlock
> > > call, and we can guess it's been in that function for a while because of
> > > the hung
> > > task timer, and other processes waiting on that i_mmap_mutex:
> > 
> > Hm, zap_pte_range has potentially an endless loop due to the 'goto again'
> > path. Could it be a somewhat similar situation to the fallocate problem, but
> > where parallel faulters on shared memory are preventing a process from
> > exiting? Although they don't fault the pages into the same address space,
> > they could maybe somehow interact through the TLB flushing code? And only
> > after fixing the original problem we can observe this one?
> 
> That's a good thought.  It ought to make forward progress nonetheless,
> but I believe (please check, I'm rushing) that there's an off-by-one in
> that path which could leave us hanging - but only when __tlb_remove_page()
> repeatedly fails, which would only happen if exceptionally low on memory??
> 
> Does this patch look good, and does it make any difference to the hang?

I should add that I think that this patch is correct in itself, but
won't actually make any difference to anything.  I'm still looking
through Sasha's log for clues (but shall have to give up soon).

Hugh

> 
> --- mmotm/mm/memory.c	2014-07-02 15:32:22.212311544 -0700
> +++ linux/mm/memory.c	2014-07-09 09:56:33.724159443 -0700
> @@ -1145,6 +1145,7 @@ again:
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			if (unlikely(!__tlb_remove_page(tlb, page))) {
> +				addr += PAGE_SIZE;
>  				force_flush = 1;
>  				break;
>  			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-09 16:03         ` Sasha Levin
@ 2014-07-10  7:37             ` Hugh Dickins
  2014-07-10  7:37             ` Hugh Dickins
  1 sibling, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10  7:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Vlastimil Babka, Hugh Dickins, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On Wed, 9 Jul 2014, Sasha Levin wrote:
> On 07/09/2014 08:47 AM, Sasha Levin wrote:
> >> > So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...
> > The stacks print got garbled due to having large amount of tasks and too low of a
> > console buffer. I've fixed that and will update when (if) the problem reproduces.
> 
> Okay, so the issue reproduces on today's -next as well, and here's my analysis.
> 
> Hung task timer was triggered for trinity-c37:
> 
> [  483.991095] INFO: task trinity-c37:8968 blocked for more than 120 seconds.
> [  483.995898]       Not tainted 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775
> [  484.000405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.004961] trinity-c37     D 00000000ffffffff 13160  8968   8558 0x10000000
> [  484.009035]  ffff8800c0457ce8 0000000000000006 ffffffff9ac1d920 0000000000000001
> [  484.012654]  ffff8800c0457fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  484.015716]  ffff880201610000 ffff8800c0bc3000 ffff8800c0457cd8 ffff880223115bb0
> [  484.050723] Call Trace:
> [  484.051831] schedule (kernel/sched/core.c:2841)
> [  484.053683] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  484.055979] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  484.058175] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.060441] ? get_parent_ip (kernel/sched/core.c:2555)
> [  484.063899] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.067485] shmem_fallocate (mm/shmem.c:1760)
> [  484.071113] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  484.075128] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  484.078566] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  484.082074] tracesys (arch/x86/kernel/entry_64.S:542)
> [  484.150284] 2 locks held by trinity-c37/8968:
> [  484.152995] #0: (sb_writers#16){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  484.158692] #1: (&sb->s_type->i_mutex_key#18){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1760)
> 
> It has acquired sb_writers lock ("sb_start_write(inode->i_sb);") in do_fallocate and
> is blocking on an attempt to acquire i_mutex in shmem_fallocate():
> 
>         if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>                 return -EOPNOTSUPP;
> 
>         mutex_lock(&inode->i_mutex); <=== HERE
> 
>         if (mode & FALLOC_FL_PUNCH_HOLE) {
> 
> Now, looking into what actually holds i_mutex rather than waiting on it we see trinity-c507:
> 
> [  488.014804] trinity-c507    D 0000000000000002 13112  9445   8558 0x10000002
> [  488.014860]  ffff88010391fab8 0000000000000002 ffffffff9abff6b0 0000000000000002
> [  488.014894]  ffff88010391ffd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  488.014912]  ffff8800be3f8000 ffff88010389b000 ffff88010391faa8 ffff880223115d58
> [  488.014942] Call Trace:
> [  488.014948] schedule (kernel/sched/core.c:2841)
> [  488.014960] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  488.014970] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  488.014983] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.014992] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.015005] unmap_mapping_range (mm/memory.c:2392)
> [  488.015021] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  488.015041] shmem_undo_range (mm/shmem.c:441)
> [  488.015059] shmem_truncate_range (mm/shmem.c:537)
> [  488.015069] shmem_fallocate (mm/shmem.c:1771)
> [  488.015079] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  488.015098] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  488.015110] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  488.015131] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> It has acquired i_mutex lock in shmem_fallocate() and is now blocking on i_mmap_mutex
> in unmap_mapping_range():
> 
>         details.check_mapping = even_cows? NULL: mapping;
>         details.nonlinear_vma = NULL;
>         details.first_index = hba;
>         details.last_index = hba + hlen - 1;
>         if (details.last_index < details.first_index)
>                 details.last_index = ULONG_MAX;
> 
> 
>         mutex_lock(&mapping->i_mmap_mutex); <==== HERE
>         if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                 unmap_mapping_range_tree(&mapping->i_mmap, &details);

I agree with your analysis up to here.

> 
> The only process that actually holds a i_mmap_mutex (instead of just spinning on it)
> is trinity-c402:

But not here.  trinity-c402 would not be holding i_mmap_mutex when it
does the zap_pte_range for exit_mmap below.  Callers of zap_pte_range
coming from unmap_mapping_range would hold i_mmap_mutex, but that's
not the case for exit_mmap.

I realize that the later part of the log shows
    1 lock held by trinity-c402/9339:
    #0: (&mapping->i_mmap_mutex){+.+...}, at: unlink_file_vma (mm/mmap.c:245)
but that's because trinity-c402 completed its unmap_vmas, and had
advanced to free_pgtables by the time that part of the log got written -
given the horrid misfeature that locks being tried are reported as held.

(There are a few other cases where the task has advanced in between
the stack dump and the locks held dump e.g. -c29, -c83, -c87.)

So I think our focus on zap_pte_range was mistaken.

I did not find who was holding the i_mmap_mutex so many are waiting for.
I could easily have missed it, but my suspicion is that nobody holds it,
that it merely got corrupted so that nobody can take it.

I do think that the most useful thing you could do at the moment,
is to switch away from running trinity on -next temporarily, and
run it instead on Linus's current git or on 3.16-rc4, but with
f00cdc6df7d7 reverted and my "take 2" inserted in its place.

That tree would also include Heiko's seq_buf_alloc() patch, which
trinity on -next has cast similar doubt upon: at present, we do
not know if Heiko's patch and my patch are bad in themselves,
or exposing other bugs in 3.16-rc, or exposing bugs in -next.

Hugh

> 
> [  487.925991] trinity-c402    R  running task    13160  9339   8558 0x10000000
> [  487.926007]  ffff8800b7eb7b88 0000000000000002 ffff88006efe3290 0000000000000282
> [  487.926013]  ffff8800b7eb7fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.926022]  ffff8800362b3000 ffff8800b7eb8000 ffff8800b7eb7b88 ffff8800b7eb7fd8
> [  487.926028] Call Trace:
> [  487.926030] preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)
> [  487.926034] ___preempt_schedule (arch/x86/kernel/preempt.S:11)
> [  487.926039] ? zap_pte_range (mm/memory.c:1218)
> [  487.926042] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926045] ? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926049] zap_pte_range (mm/memory.c:1218)
> [  487.926056] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
> [  487.926060] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  487.926066] exit_mmap (mm/mmap.c:2802)
> [  487.926069] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.926075] mmput (kernel/fork.c:638)
> [  487.926079] do_exit (kernel/exit.c:744)
> [  487.926086] do_group_exit (kernel/exit.c:884)
> [  487.926091] SyS_exit_group (kernel/exit.c:895)
> [  487.926095] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> We can see that it's not blocked since it's in the middle of a spinlock unlock
> call, and we can guess it's been in that function for a while because of the hung
> task timer, and other processes waiting on that i_mmap_mutex:
> 
> 
> [  487.857145] trinity-c338    D 0000000000000008 12904  9274   8558 0x10000004
> [  487.857179]  ffff8800b9bcbaa8 0000000000000002 ffffffff9abff6b0 0000000000000000
> [  487.857193]  ffff8800b9bcbfd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.857202]  ffff880107198000 ffff8800b9bc0000 ffff8800b9bcba98 ffff8801ec17f090
> [  487.857209] Call Trace:
> [  487.857210] schedule (kernel/sched/core.c:2841)
> [  487.857222] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  487.857228] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  487.857237] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857241] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857251] unlink_file_vma (mm/mmap.c:245)
> [  487.857261] free_pgtables (mm/memory.c:540)
> [  487.857275] exit_mmap (mm/mmap.c:2803)
> [  487.857284] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857291] mmput (kernel/fork.c:638)
> [  487.857305] do_exit (kernel/exit.c:744)
> [  487.857317] ? get_signal_to_deliver (kernel/signal.c:2333)
> [  487.857327] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  487.857339] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
> [  487.857362] do_group_exit (kernel/exit.c:884)
> [  487.857376] get_signal_to_deliver (kernel/signal.c:2351)
> [  487.857392] do_signal (arch/x86/kernel/signal.c:698)
> [  487.857399] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857411] ? vtime_account_user (kernel/sched/cputime.c:687)
> [  487.857418] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857431] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  487.857441] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  487.857451] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [  487.857464] do_notify_resume (arch/x86/kernel/signal.c:751)
> [  487.857479] int_signal (arch/x86/kernel/entry_64.S:600)
> 
> Hope that helps. Full log attached for reference.
> 
> 
> Thanks,
> Sasha
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10  7:37             ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10  7:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Vlastimil Babka, Hugh Dickins, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On Wed, 9 Jul 2014, Sasha Levin wrote:
> On 07/09/2014 08:47 AM, Sasha Levin wrote:
> >> > So it would again help to see stacks of other tasks, to see who holds the i_mutex and where it's stuck...
> > The stacks print got garbled due to having large amount of tasks and too low of a
> > console buffer. I've fixed that and will update when (if) the problem reproduces.
> 
> Okay, so the issue reproduces on today's -next as well, and here's my analysis.
> 
> Hung task timer was triggered for trinity-c37:
> 
> [  483.991095] INFO: task trinity-c37:8968 blocked for more than 120 seconds.
> [  483.995898]       Not tainted 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775
> [  484.000405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.004961] trinity-c37     D 00000000ffffffff 13160  8968   8558 0x10000000
> [  484.009035]  ffff8800c0457ce8 0000000000000006 ffffffff9ac1d920 0000000000000001
> [  484.012654]  ffff8800c0457fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  484.015716]  ffff880201610000 ffff8800c0bc3000 ffff8800c0457cd8 ffff880223115bb0
> [  484.050723] Call Trace:
> [  484.051831] schedule (kernel/sched/core.c:2841)
> [  484.053683] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  484.055979] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  484.058175] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.060441] ? get_parent_ip (kernel/sched/core.c:2555)
> [  484.063899] ? shmem_fallocate (mm/shmem.c:1760)
> [  484.067485] shmem_fallocate (mm/shmem.c:1760)
> [  484.071113] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  484.075128] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  484.078566] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  484.082074] tracesys (arch/x86/kernel/entry_64.S:542)
> [  484.150284] 2 locks held by trinity-c37/8968:
> [  484.152995] #0: (sb_writers#16){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  484.158692] #1: (&sb->s_type->i_mutex_key#18){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1760)
> 
> It has acquired sb_writers lock ("sb_start_write(inode->i_sb);") in do_fallocate and
> is blocking on an attempt to acquire i_mutex in shmem_fallocate():
> 
>         if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>                 return -EOPNOTSUPP;
> 
>         mutex_lock(&inode->i_mutex); <=== HERE
> 
>         if (mode & FALLOC_FL_PUNCH_HOLE) {
> 
> Now, looking into what actually holds i_mutex rather than waiting on it we see trinity-c507:
> 
> [  488.014804] trinity-c507    D 0000000000000002 13112  9445   8558 0x10000002
> [  488.014860]  ffff88010391fab8 0000000000000002 ffffffff9abff6b0 0000000000000002
> [  488.014894]  ffff88010391ffd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  488.014912]  ffff8800be3f8000 ffff88010389b000 ffff88010391faa8 ffff880223115d58
> [  488.014942] Call Trace:
> [  488.014948] schedule (kernel/sched/core.c:2841)
> [  488.014960] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  488.014970] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  488.014983] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.014992] ? unmap_mapping_range (mm/memory.c:2392)
> [  488.015005] unmap_mapping_range (mm/memory.c:2392)
> [  488.015021] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  488.015041] shmem_undo_range (mm/shmem.c:441)
> [  488.015059] shmem_truncate_range (mm/shmem.c:537)
> [  488.015069] shmem_fallocate (mm/shmem.c:1771)
> [  488.015079] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  488.015098] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  488.015110] SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
> [  488.015131] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> It has acquired i_mutex lock in shmem_fallocate() and is now blocking on i_mmap_mutex
> in unmap_mapping_range():
> 
>         details.check_mapping = even_cows? NULL: mapping;
>         details.nonlinear_vma = NULL;
>         details.first_index = hba;
>         details.last_index = hba + hlen - 1;
>         if (details.last_index < details.first_index)
>                 details.last_index = ULONG_MAX;
> 
> 
>         mutex_lock(&mapping->i_mmap_mutex); <==== HERE
>         if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                 unmap_mapping_range_tree(&mapping->i_mmap, &details);

I agree with your analysis up to here.

> 
> The only process that actually holds a i_mmap_mutex (instead of just spinning on it)
> is trinity-c402:

But not here.  trinity-c402 would not be holding i_mmap_mutex when it
does the zap_pte_range for exit_mmap below.  Callers of zap_pte_range
coming from unmap_mapping_range would hold i_mmap_mutex, but that's
not the case for exit_mmap.

I realize that the later part of the log shows
    1 lock held by trinity-c402/9339:
    #0: (&mapping->i_mmap_mutex){+.+...}, at: unlink_file_vma (mm/mmap.c:245)
but that's because trinity-c402 completed its unmap_vmas, and had
advanced to free_pgtables by the time that part of the log got written -
given the horrid misfeature that locks being tried are reported as held.

(There are a few other cases where the task has advanced in between
the stack dump and the locks held dump e.g. -c29, -c83, -c87.)

So I think our focus on zap_pte_range was mistaken.

I did not find who was holding the i_mmap_mutex so many are waiting for.
I could easily have missed it, but my suspicion is that nobody holds it,
that it merely got corrupted so that nobody can take it.

I do think that the most useful thing you could do at the moment,
is to switch away from running trinity on -next temporarily, and
run it instead on Linus's current git or on 3.16-rc4, but with
f00cdc6df7d7 reverted and my "take 2" inserted in its place.

That tree would also include Heiko's seq_buf_alloc() patch, which
trinity on -next has cast similar doubt upon: at present, we do
not know if Heiko's patch and my patch are bad in themselves,
or exposing other bugs in 3.16-rc, or exposing bugs in -next.

Hugh

> 
> [  487.925991] trinity-c402    R  running task    13160  9339   8558 0x10000000
> [  487.926007]  ffff8800b7eb7b88 0000000000000002 ffff88006efe3290 0000000000000282
> [  487.926013]  ffff8800b7eb7fd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.926022]  ffff8800362b3000 ffff8800b7eb8000 ffff8800b7eb7b88 ffff8800b7eb7fd8
> [  487.926028] Call Trace:
> [  487.926030] preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)
> [  487.926034] ___preempt_schedule (arch/x86/kernel/preempt.S:11)
> [  487.926039] ? zap_pte_range (mm/memory.c:1218)
> [  487.926042] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926045] ? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  487.926049] zap_pte_range (mm/memory.c:1218)
> [  487.926056] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
> [  487.926060] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  487.926066] exit_mmap (mm/mmap.c:2802)
> [  487.926069] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.926075] mmput (kernel/fork.c:638)
> [  487.926079] do_exit (kernel/exit.c:744)
> [  487.926086] do_group_exit (kernel/exit.c:884)
> [  487.926091] SyS_exit_group (kernel/exit.c:895)
> [  487.926095] tracesys (arch/x86/kernel/entry_64.S:542)
> 
> We can see that it's not blocked since it's in the middle of a spinlock unlock
> call, and we can guess it's been in that function for a while because of the hung
> task timer, and other processes waiting on that i_mmap_mutex:
> 
> 
> [  487.857145] trinity-c338    D 0000000000000008 12904  9274   8558 0x10000004
> [  487.857179]  ffff8800b9bcbaa8 0000000000000002 ffffffff9abff6b0 0000000000000000
> [  487.857193]  ffff8800b9bcbfd8 00000000001e2740 00000000001e2740 00000000001e2740
> [  487.857202]  ffff880107198000 ffff8800b9bc0000 ffff8800b9bcba98 ffff8801ec17f090
> [  487.857209] Call Trace:
> [  487.857210] schedule (kernel/sched/core.c:2841)
> [  487.857222] schedule_preempt_disabled (kernel/sched/core.c:2868)
> [  487.857228] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [  487.857237] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857241] ? unlink_file_vma (mm/mmap.c:245)
> [  487.857251] unlink_file_vma (mm/mmap.c:245)
> [  487.857261] free_pgtables (mm/memory.c:540)
> [  487.857275] exit_mmap (mm/mmap.c:2803)
> [  487.857284] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857291] mmput (kernel/fork.c:638)
> [  487.857305] do_exit (kernel/exit.c:744)
> [  487.857317] ? get_signal_to_deliver (kernel/signal.c:2333)
> [  487.857327] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  487.857339] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
> [  487.857362] do_group_exit (kernel/exit.c:884)
> [  487.857376] get_signal_to_deliver (kernel/signal.c:2351)
> [  487.857392] do_signal (arch/x86/kernel/signal.c:698)
> [  487.857399] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  487.857411] ? vtime_account_user (kernel/sched/cputime.c:687)
> [  487.857418] ? preempt_count_sub (kernel/sched/core.c:2611)
> [  487.857431] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  487.857441] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  487.857451] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> [  487.857464] do_notify_resume (arch/x86/kernel/signal.c:751)
> [  487.857479] int_signal (arch/x86/kernel/entry_64.S:600)
> 
> Hope that helps. Full log attached for reference.
> 
> 
> Thanks,
> Sasha
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10  7:37             ` Hugh Dickins
@ 2014-07-10 12:46               ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 12:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 03:37 AM, Hugh Dickins wrote:
> I do think that the most useful thing you could do at the moment,
> is to switch away from running trinity on -next temporarily, and
> run it instead on Linus's current git or on 3.16-rc4, but with
> f00cdc6df7d7 reverted and my "take 2" inserted in its place.
> 
> That tree would also include Heiko's seq_buf_alloc() patch, which
> trinity on -next has cast similar doubt upon: at present, we do
> not know if Heiko's patch and my patch are bad in themselves,
> or exposing other bugs in 3.16-rc, or exposing bugs in -next.

Funny enough, Linus's tree doesn't even boot properly here. It's
going to take longer than I expected...


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 12:46               ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 12:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 03:37 AM, Hugh Dickins wrote:
> I do think that the most useful thing you could do at the moment,
> is to switch away from running trinity on -next temporarily, and
> run it instead on Linus's current git or on 3.16-rc4, but with
> f00cdc6df7d7 reverted and my "take 2" inserted in its place.
> 
> That tree would also include Heiko's seq_buf_alloc() patch, which
> trinity on -next has cast similar doubt upon: at present, we do
> not know if Heiko's patch and my patch are bad in themselves,
> or exposing other bugs in 3.16-rc, or exposing bugs in -next.

Funny enough, Linus's tree doesn't even boot properly here. It's
going to take longer than I expected...


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 12:46               ` Sasha Levin
@ 2014-07-10 17:21                 ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 17:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 08:46 AM, Sasha Levin wrote:
> On 07/10/2014 03:37 AM, Hugh Dickins wrote:
>> > I do think that the most useful thing you could do at the moment,
>> > is to switch away from running trinity on -next temporarily, and
>> > run it instead on Linus's current git or on 3.16-rc4, but with
>> > f00cdc6df7d7 reverted and my "take 2" inserted in its place.
>> > 
>> > That tree would also include Heiko's seq_buf_alloc() patch, which
>> > trinity on -next has cast similar doubt upon: at present, we do
>> > not know if Heiko's patch and my patch are bad in themselves,
>> > or exposing other bugs in 3.16-rc, or exposing bugs in -next.
> Funny enough, Linus's tree doesn't even boot properly here. It's
> going to take longer than I expected...

While I'm failing to reproduce the mountinfo issue on Linus's tree,
the shmem_fallocate one reproduces rather easily.

I've reverted your original fix and applied the "take 2" one as you
suggested, there are no other significant changes on top on Linus's
tree in this case (just Heiko's test patch and some improvements to
what gets printed on hung tasks plus an assortment on unrelated fixes
that are present in next).

The same structure of locks that was analysed in -next exists here
as well:

Triggered here:

[  364.601210] INFO: task trinity-c214:9083 blocked for more than 120 seconds.
[  364.605498]       Not tainted 3.16.0-rc4-sasha-00069-g615ded7-dirty #793
[  364.609705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  364.614939] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
[  364.619414]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
[  364.624540]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  364.629378]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
[  364.650601] Call Trace:
[  364.652252] schedule (kernel/sched/core.c:2832)
[  364.655337] schedule_preempt_disabled (kernel/sched/core.c:2859)
[  364.659287] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
[  364.663131] ? shmem_fallocate (mm/shmem.c:1738)
[  364.666616] ? get_parent_ip (kernel/sched/core.c:2546)
[  364.670454] ? shmem_fallocate (mm/shmem.c:1738)
[  364.674159] shmem_fallocate (mm/shmem.c:1738)
[  364.676589] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.678415] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  364.680806] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.684206] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  364.687313] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.690343] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  364.692913] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  364.694450] tracesys (arch/x86/kernel/entry_64.S:543)
[  364.696034] 2 locks held by trinity-c214/9083:
[  364.697222] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
[  364.700686] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)

Holding i_mutex and blocking on i_mmap_mutex:

[  367.615992] trinity-c100    R  running task    13048  8967   8490 0x00000006
[  367.616039]  ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
[  367.616075]  ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.616113]  ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
[  367.616152] Call Trace:
[  367.616165] preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912)
[  367.616182] retint_kernel (arch/x86/kernel/entry_64.S:937)
[  367.616198] ? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
[  367.616213] ? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
[  367.616226] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
[  367.616240] ? get_parent_ip (kernel/sched/core.c:2546)
[  367.616260] ? unmap_mapping_range (mm/memory.c:2391)
[  367.616267] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
[  367.616271] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[  367.616281] shmem_undo_range (mm/shmem.c:429)
[  367.616289] shmem_truncate_range (mm/shmem.c:528)
[  367.616296] shmem_fallocate (mm/shmem.c:1749)
[  367.616301] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616307] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  367.616314] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616320] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  367.616326] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616333] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  367.616340] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  367.616345] tracesys (arch/x86/kernel/entry_64.S:543)

And finally, (not) holding the i_mmap_mutex:

[  367.638911] trinity-c190    R  running task    12680  9059   8490 0x00000004
[  367.638928]  ffff8800193db828 0000000000000002 0000000000000030 0000000000000000
[  367.638937]  ffff8800193dbfd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.638943]  ffff8800048eb000 ffff8800193d0000 ffff8800193db818 ffff8800193dbfd8
[  367.638950] Call Trace:
[  367.638952]  [<ffffffff8e5170f4>] preempt_schedule_irq+0x84/0x100
[  367.638956]  [<ffffffff8e51ec50>] retint_kernel+0x20/0x30
[  367.638960]  [<ffffffff8b290c36>] ? free_hot_cold_page+0x1c6/0x1f0
[  367.638962]  [<ffffffff8b291566>] free_hot_cold_page_list+0x126/0x1a0
[  367.638974]  [<ffffffff8b29702e>] release_pages+0x21e/0x250
[  367.638989]  [<ffffffff8b2cf0c5>] free_pages_and_swap_cache+0x55/0xc0
[  367.638999]  [<ffffffff8b2b68ac>] tlb_flush_mmu_free+0x4c/0x60
[  367.639012]  [<ffffffff8b2b8dd1>] zap_pte_range+0x491/0x4f0
[  367.639019]  [<ffffffff8b2b92ce>] unmap_single_vma+0x49e/0x4c0
[  367.639025]  [<ffffffff8b2b9675>] unmap_vmas+0x65/0x90
[  367.639029]  [<ffffffff8b2c3344>] exit_mmap+0xd4/0x180
[  367.639032]  [<ffffffff8b15c4ab>] mmput+0x5b/0xf0
[  367.639038]  [<ffffffff8b163043>] do_exit+0x3a3/0xc80
[  367.639041]  [<ffffffff8bb46d37>] ? debug_smp_processor_id+0x17/0x20
[  367.639044]  [<ffffffff8b1c2fae>] ? put_lock_stats.isra.12+0xe/0x30
[  367.639047]  [<ffffffff8e51d100>] ? _raw_spin_unlock_irq+0x30/0x70
[  367.639051]  [<ffffffff8b1639f4>] do_group_exit+0x84/0xd0
[  367.639055]  [<ffffffff8b177657>] get_signal_to_deliver+0x807/0x910
[  367.639059]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
[  367.639063]  [<ffffffff8b0706c7>] do_signal+0x57/0x9a0
[  367.639066]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
[  367.639070]  [<ffffffff8b19f228>] ? preempt_count_sub+0xd8/0x130
[  367.639072]  [<ffffffff8b2848d5>] ? context_tracking_user_exit+0x1b5/0x260
[  367.639078]  [<ffffffff8bb46d13>] ? __this_cpu_preempt_check+0x13/0x20
[  367.639081]  [<ffffffff8b1c5df4>] ? trace_hardirqs_on_caller+0x1f4/0x290
[  367.639087]  [<ffffffff8b07104a>] do_notify_resume+0x3a/0xb0
[  367.639089]  [<ffffffff8e51e02a>] int_signal+0x12/0x17


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 17:21                 ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 17:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 08:46 AM, Sasha Levin wrote:
> On 07/10/2014 03:37 AM, Hugh Dickins wrote:
>> > I do think that the most useful thing you could do at the moment,
>> > is to switch away from running trinity on -next temporarily, and
>> > run it instead on Linus's current git or on 3.16-rc4, but with
>> > f00cdc6df7d7 reverted and my "take 2" inserted in its place.
>> > 
>> > That tree would also include Heiko's seq_buf_alloc() patch, which
>> > trinity on -next has cast similar doubt upon: at present, we do
>> > not know if Heiko's patch and my patch are bad in themselves,
>> > or exposing other bugs in 3.16-rc, or exposing bugs in -next.
> Funny enough, Linus's tree doesn't even boot properly here. It's
> going to take longer than I expected...

While I'm failing to reproduce the mountinfo issue on Linus's tree,
the shmem_fallocate one reproduces rather easily.

I've reverted your original fix and applied the "take 2" one as you
suggested, there are no other significant changes on top on Linus's
tree in this case (just Heiko's test patch and some improvements to
what gets printed on hung tasks plus an assortment on unrelated fixes
that are present in next).

The same structure of locks that was analysed in -next exists here
as well:

Triggered here:

[  364.601210] INFO: task trinity-c214:9083 blocked for more than 120 seconds.
[  364.605498]       Not tainted 3.16.0-rc4-sasha-00069-g615ded7-dirty #793
[  364.609705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  364.614939] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
[  364.619414]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
[  364.624540]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  364.629378]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
[  364.650601] Call Trace:
[  364.652252] schedule (kernel/sched/core.c:2832)
[  364.655337] schedule_preempt_disabled (kernel/sched/core.c:2859)
[  364.659287] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
[  364.663131] ? shmem_fallocate (mm/shmem.c:1738)
[  364.666616] ? get_parent_ip (kernel/sched/core.c:2546)
[  364.670454] ? shmem_fallocate (mm/shmem.c:1738)
[  364.674159] shmem_fallocate (mm/shmem.c:1738)
[  364.676589] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.678415] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  364.680806] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.684206] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  364.687313] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  364.690343] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  364.692913] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  364.694450] tracesys (arch/x86/kernel/entry_64.S:543)
[  364.696034] 2 locks held by trinity-c214/9083:
[  364.697222] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
[  364.700686] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)

Holding i_mutex and blocking on i_mmap_mutex:

[  367.615992] trinity-c100    R  running task    13048  8967   8490 0x00000006
[  367.616039]  ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
[  367.616075]  ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.616113]  ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
[  367.616152] Call Trace:
[  367.616165] preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912)
[  367.616182] retint_kernel (arch/x86/kernel/entry_64.S:937)
[  367.616198] ? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
[  367.616213] ? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
[  367.616226] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
[  367.616240] ? get_parent_ip (kernel/sched/core.c:2546)
[  367.616260] ? unmap_mapping_range (mm/memory.c:2391)
[  367.616267] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
[  367.616271] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[  367.616281] shmem_undo_range (mm/shmem.c:429)
[  367.616289] shmem_truncate_range (mm/shmem.c:528)
[  367.616296] shmem_fallocate (mm/shmem.c:1749)
[  367.616301] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616307] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  367.616314] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616320] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  367.616326] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.616333] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  367.616340] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  367.616345] tracesys (arch/x86/kernel/entry_64.S:543)

And finally, (not) holding the i_mmap_mutex:

[  367.638911] trinity-c190    R  running task    12680  9059   8490 0x00000004
[  367.638928]  ffff8800193db828 0000000000000002 0000000000000030 0000000000000000
[  367.638937]  ffff8800193dbfd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.638943]  ffff8800048eb000 ffff8800193d0000 ffff8800193db818 ffff8800193dbfd8
[  367.638950] Call Trace:
[  367.638952]  [<ffffffff8e5170f4>] preempt_schedule_irq+0x84/0x100
[  367.638956]  [<ffffffff8e51ec50>] retint_kernel+0x20/0x30
[  367.638960]  [<ffffffff8b290c36>] ? free_hot_cold_page+0x1c6/0x1f0
[  367.638962]  [<ffffffff8b291566>] free_hot_cold_page_list+0x126/0x1a0
[  367.638974]  [<ffffffff8b29702e>] release_pages+0x21e/0x250
[  367.638989]  [<ffffffff8b2cf0c5>] free_pages_and_swap_cache+0x55/0xc0
[  367.638999]  [<ffffffff8b2b68ac>] tlb_flush_mmu_free+0x4c/0x60
[  367.639012]  [<ffffffff8b2b8dd1>] zap_pte_range+0x491/0x4f0
[  367.639019]  [<ffffffff8b2b92ce>] unmap_single_vma+0x49e/0x4c0
[  367.639025]  [<ffffffff8b2b9675>] unmap_vmas+0x65/0x90
[  367.639029]  [<ffffffff8b2c3344>] exit_mmap+0xd4/0x180
[  367.639032]  [<ffffffff8b15c4ab>] mmput+0x5b/0xf0
[  367.639038]  [<ffffffff8b163043>] do_exit+0x3a3/0xc80
[  367.639041]  [<ffffffff8bb46d37>] ? debug_smp_processor_id+0x17/0x20
[  367.639044]  [<ffffffff8b1c2fae>] ? put_lock_stats.isra.12+0xe/0x30
[  367.639047]  [<ffffffff8e51d100>] ? _raw_spin_unlock_irq+0x30/0x70
[  367.639051]  [<ffffffff8b1639f4>] do_group_exit+0x84/0xd0
[  367.639055]  [<ffffffff8b177657>] get_signal_to_deliver+0x807/0x910
[  367.639059]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
[  367.639063]  [<ffffffff8b0706c7>] do_signal+0x57/0x9a0
[  367.639066]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
[  367.639070]  [<ffffffff8b19f228>] ? preempt_count_sub+0xd8/0x130
[  367.639072]  [<ffffffff8b2848d5>] ? context_tracking_user_exit+0x1b5/0x260
[  367.639078]  [<ffffffff8bb46d13>] ? __this_cpu_preempt_check+0x13/0x20
[  367.639081]  [<ffffffff8b1c5df4>] ? trace_hardirqs_on_caller+0x1f4/0x290
[  367.639087]  [<ffffffff8b07104a>] do_notify_resume+0x3a/0xb0
[  367.639089]  [<ffffffff8e51e02a>] int_signal+0x12/0x17


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 17:21                 ` Sasha Levin
@ 2014-07-10 17:55                   ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 17:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Heiko Carstens, Hugh Dickins, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 08:46 AM, Sasha Levin wrote:
> > On 07/10/2014 03:37 AM, Hugh Dickins wrote:
> >> > I do think that the most useful thing you could do at the moment,
> >> > is to switch away from running trinity on -next temporarily, and
> >> > run it instead on Linus's current git or on 3.16-rc4, but with
> >> > f00cdc6df7d7 reverted and my "take 2" inserted in its place.
> >> > 
> >> > That tree would also include Heiko's seq_buf_alloc() patch, which
> >> > trinity on -next has cast similar doubt upon: at present, we do
> >> > not know if Heiko's patch and my patch are bad in themselves,
> >> > or exposing other bugs in 3.16-rc, or exposing bugs in -next.
> > Funny enough, Linus's tree doesn't even boot properly here. It's
> > going to take longer than I expected...

Thanks a lot for getting down to this so quickly.

> 
> While I'm failing to reproduce the mountinfo issue on Linus's tree,

That's good news for Heiko's patch:
no reason to rush and revert it, I guess.

> the shmem_fallocate one reproduces rather easily.

"Good" news of the opposite kind ;)  Very useful to know that.

> 
> I've reverted your original fix and applied the "take 2" one as you
> suggested, there are no other significant changes on top on Linus's
> tree in this case (just Heiko's test patch and some improvements to
> what gets printed on hung tasks plus an assortment on unrelated fixes
> that are present in next).

Just right, thanks.

> 
> The same structure of locks that was analysed in -next exists here
> as well:
> 
> Triggered here:
> 
> [  364.601210] INFO: task trinity-c214:9083 blocked for more than 120 seconds.
> [  364.605498]       Not tainted 3.16.0-rc4-sasha-00069-g615ded7-dirty #793
> [  364.609705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  364.614939] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
> [  364.619414]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
> [  364.624540]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  364.629378]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
> [  364.650601] Call Trace:
> [  364.652252] schedule (kernel/sched/core.c:2832)
> [  364.655337] schedule_preempt_disabled (kernel/sched/core.c:2859)
> [  364.659287] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
> [  364.663131] ? shmem_fallocate (mm/shmem.c:1738)
> [  364.666616] ? get_parent_ip (kernel/sched/core.c:2546)
> [  364.670454] ? shmem_fallocate (mm/shmem.c:1738)
> [  364.674159] shmem_fallocate (mm/shmem.c:1738)
> [  364.676589] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.678415] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  364.680806] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.684206] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  364.687313] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.690343] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  364.692913] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  364.694450] tracesys (arch/x86/kernel/entry_64.S:543)
> [  364.696034] 2 locks held by trinity-c214/9083:
> [  364.697222] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  364.700686] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)
> 
> Holding i_mutex and blocking on i_mmap_mutex:
> 
> [  367.615992] trinity-c100    R  running task    13048  8967   8490 0x00000006
> [  367.616039]  ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
> [  367.616075]  ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  367.616113]  ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
> [  367.616152] Call Trace:
> [  367.616165] preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912)
> [  367.616182] retint_kernel (arch/x86/kernel/entry_64.S:937)
> [  367.616198] ? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
> [  367.616213] ? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
> [  367.616226] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
> [  367.616240] ? get_parent_ip (kernel/sched/core.c:2546)
> [  367.616260] ? unmap_mapping_range (mm/memory.c:2391)
> [  367.616267] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
> [  367.616271] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  367.616281] shmem_undo_range (mm/shmem.c:429)
> [  367.616289] shmem_truncate_range (mm/shmem.c:528)
> [  367.616296] shmem_fallocate (mm/shmem.c:1749)
> [  367.616301] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616307] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  367.616314] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616320] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  367.616326] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616333] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  367.616340] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  367.616345] tracesys (arch/x86/kernel/entry_64.S:543)
> 
> And finally, (not) holding the i_mmap_mutex:

I don't understand what prompts you to show this particular task.
I imagine the dump shows lots of other tasks which are waiting to get an
i_mmap_mutex, and quite a lot of other tasks which are neither waiting
for nor holding an i_mmap_mutex.

Why are you showing this one in particular?  Because it looks like the
one you fingered yesterday?  But I didn't see a good reason to finger
that one either.

Mind you, I'm still baffled by the oops on testing PageAnon that you
got a few days ago in zap_pte_range(), so maybe zap_pte_range() does
deserve special attention.

> 
> [  367.638911] trinity-c190    R  running task    12680  9059   8490 0x00000004
> [  367.638928]  ffff8800193db828 0000000000000002 0000000000000030 0000000000000000
> [  367.638937]  ffff8800193dbfd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  367.638943]  ffff8800048eb000 ffff8800193d0000 ffff8800193db818 ffff8800193dbfd8
> [  367.638950] Call Trace:
> [  367.638952]  [<ffffffff8e5170f4>] preempt_schedule_irq+0x84/0x100
> [  367.638956]  [<ffffffff8e51ec50>] retint_kernel+0x20/0x30
> [  367.638960]  [<ffffffff8b290c36>] ? free_hot_cold_page+0x1c6/0x1f0
> [  367.638962]  [<ffffffff8b291566>] free_hot_cold_page_list+0x126/0x1a0
> [  367.638974]  [<ffffffff8b29702e>] release_pages+0x21e/0x250
> [  367.638989]  [<ffffffff8b2cf0c5>] free_pages_and_swap_cache+0x55/0xc0
> [  367.638999]  [<ffffffff8b2b68ac>] tlb_flush_mmu_free+0x4c/0x60
> [  367.639012]  [<ffffffff8b2b8dd1>] zap_pte_range+0x491/0x4f0
> [  367.639019]  [<ffffffff8b2b92ce>] unmap_single_vma+0x49e/0x4c0
> [  367.639025]  [<ffffffff8b2b9675>] unmap_vmas+0x65/0x90
> [  367.639029]  [<ffffffff8b2c3344>] exit_mmap+0xd4/0x180
> [  367.639032]  [<ffffffff8b15c4ab>] mmput+0x5b/0xf0
> [  367.639038]  [<ffffffff8b163043>] do_exit+0x3a3/0xc80
> [  367.639041]  [<ffffffff8bb46d37>] ? debug_smp_processor_id+0x17/0x20
> [  367.639044]  [<ffffffff8b1c2fae>] ? put_lock_stats.isra.12+0xe/0x30
> [  367.639047]  [<ffffffff8e51d100>] ? _raw_spin_unlock_irq+0x30/0x70
> [  367.639051]  [<ffffffff8b1639f4>] do_group_exit+0x84/0xd0
> [  367.639055]  [<ffffffff8b177657>] get_signal_to_deliver+0x807/0x910
> [  367.639059]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
> [  367.639063]  [<ffffffff8b0706c7>] do_signal+0x57/0x9a0
> [  367.639066]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
> [  367.639070]  [<ffffffff8b19f228>] ? preempt_count_sub+0xd8/0x130
> [  367.639072]  [<ffffffff8b2848d5>] ? context_tracking_user_exit+0x1b5/0x260
> [  367.639078]  [<ffffffff8bb46d13>] ? __this_cpu_preempt_check+0x13/0x20
> [  367.639081]  [<ffffffff8b1c5df4>] ? trace_hardirqs_on_caller+0x1f4/0x290
> [  367.639087]  [<ffffffff8b07104a>] do_notify_resume+0x3a/0xb0
> [  367.639089]  [<ffffffff8e51e02a>] int_signal+0x12/0x17

I'm not sure whether I should ask you to attach the full dump or not:
I learnt very little from yesterday's 11MB, and from what you say this
one will look very similar.  Oh, I'd better ask for it, but I bet you
process these things much faster than I can manage.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 17:55                   ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 17:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Heiko Carstens, Hugh Dickins, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 08:46 AM, Sasha Levin wrote:
> > On 07/10/2014 03:37 AM, Hugh Dickins wrote:
> >> > I do think that the most useful thing you could do at the moment,
> >> > is to switch away from running trinity on -next temporarily, and
> >> > run it instead on Linus's current git or on 3.16-rc4, but with
> >> > f00cdc6df7d7 reverted and my "take 2" inserted in its place.
> >> > 
> >> > That tree would also include Heiko's seq_buf_alloc() patch, which
> >> > trinity on -next has cast similar doubt upon: at present, we do
> >> > not know if Heiko's patch and my patch are bad in themselves,
> >> > or exposing other bugs in 3.16-rc, or exposing bugs in -next.
> > Funny enough, Linus's tree doesn't even boot properly here. It's
> > going to take longer than I expected...

Thanks a lot for getting down to this so quickly.

> 
> While I'm failing to reproduce the mountinfo issue on Linus's tree,

That's good news for Heiko's patch:
no reason to rush and revert it, I guess.

> the shmem_fallocate one reproduces rather easily.

"Good" news of the opposite kind ;)  Very useful to know that.

> 
> I've reverted your original fix and applied the "take 2" one as you
> suggested, there are no other significant changes on top on Linus's
> tree in this case (just Heiko's test patch and some improvements to
> what gets printed on hung tasks plus an assortment on unrelated fixes
> that are present in next).

Just right, thanks.

> 
> The same structure of locks that was analysed in -next exists here
> as well:
> 
> Triggered here:
> 
> [  364.601210] INFO: task trinity-c214:9083 blocked for more than 120 seconds.
> [  364.605498]       Not tainted 3.16.0-rc4-sasha-00069-g615ded7-dirty #793
> [  364.609705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  364.614939] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
> [  364.619414]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
> [  364.624540]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  364.629378]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
> [  364.650601] Call Trace:
> [  364.652252] schedule (kernel/sched/core.c:2832)
> [  364.655337] schedule_preempt_disabled (kernel/sched/core.c:2859)
> [  364.659287] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
> [  364.663131] ? shmem_fallocate (mm/shmem.c:1738)
> [  364.666616] ? get_parent_ip (kernel/sched/core.c:2546)
> [  364.670454] ? shmem_fallocate (mm/shmem.c:1738)
> [  364.674159] shmem_fallocate (mm/shmem.c:1738)
> [  364.676589] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.678415] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  364.680806] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.684206] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  364.687313] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  364.690343] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  364.692913] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  364.694450] tracesys (arch/x86/kernel/entry_64.S:543)
> [  364.696034] 2 locks held by trinity-c214/9083:
> [  364.697222] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
> [  364.700686] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)
> 
> Holding i_mutex and blocking on i_mmap_mutex:
> 
> [  367.615992] trinity-c100    R  running task    13048  8967   8490 0x00000006
> [  367.616039]  ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
> [  367.616075]  ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  367.616113]  ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
> [  367.616152] Call Trace:
> [  367.616165] preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912)
> [  367.616182] retint_kernel (arch/x86/kernel/entry_64.S:937)
> [  367.616198] ? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
> [  367.616213] ? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
> [  367.616226] zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
> [  367.616240] ? get_parent_ip (kernel/sched/core.c:2546)
> [  367.616260] ? unmap_mapping_range (mm/memory.c:2391)
> [  367.616267] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
> [  367.616271] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
> [  367.616281] shmem_undo_range (mm/shmem.c:429)
> [  367.616289] shmem_truncate_range (mm/shmem.c:528)
> [  367.616296] shmem_fallocate (mm/shmem.c:1749)
> [  367.616301] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616307] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  367.616314] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616320] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
> [  367.616326] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
> [  367.616333] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [  367.616340] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  367.616345] tracesys (arch/x86/kernel/entry_64.S:543)
> 
> And finally, (not) holding the i_mmap_mutex:

I don't understand what prompts you to show this particular task.
I imagine the dump shows lots of other tasks which are waiting to get an
i_mmap_mutex, and quite a lot of other tasks which are neither waiting
for nor holding an i_mmap_mutex.

Why are you showing this one in particular?  Because it looks like the
one you fingered yesterday?  But I didn't see a good reason to finger
that one either.

Mind you, I'm still baffled by the oops on testing PageAnon that you
got a few days ago in zap_pte_range(), so maybe zap_pte_range() does
deserve special attention.

> 
> [  367.638911] trinity-c190    R  running task    12680  9059   8490 0x00000004
> [  367.638928]  ffff8800193db828 0000000000000002 0000000000000030 0000000000000000
> [  367.638937]  ffff8800193dbfd8 00000000001d7740 00000000001d7740 00000000001d7740
> [  367.638943]  ffff8800048eb000 ffff8800193d0000 ffff8800193db818 ffff8800193dbfd8
> [  367.638950] Call Trace:
> [  367.638952]  [<ffffffff8e5170f4>] preempt_schedule_irq+0x84/0x100
> [  367.638956]  [<ffffffff8e51ec50>] retint_kernel+0x20/0x30
> [  367.638960]  [<ffffffff8b290c36>] ? free_hot_cold_page+0x1c6/0x1f0
> [  367.638962]  [<ffffffff8b291566>] free_hot_cold_page_list+0x126/0x1a0
> [  367.638974]  [<ffffffff8b29702e>] release_pages+0x21e/0x250
> [  367.638989]  [<ffffffff8b2cf0c5>] free_pages_and_swap_cache+0x55/0xc0
> [  367.638999]  [<ffffffff8b2b68ac>] tlb_flush_mmu_free+0x4c/0x60
> [  367.639012]  [<ffffffff8b2b8dd1>] zap_pte_range+0x491/0x4f0
> [  367.639019]  [<ffffffff8b2b92ce>] unmap_single_vma+0x49e/0x4c0
> [  367.639025]  [<ffffffff8b2b9675>] unmap_vmas+0x65/0x90
> [  367.639029]  [<ffffffff8b2c3344>] exit_mmap+0xd4/0x180
> [  367.639032]  [<ffffffff8b15c4ab>] mmput+0x5b/0xf0
> [  367.639038]  [<ffffffff8b163043>] do_exit+0x3a3/0xc80
> [  367.639041]  [<ffffffff8bb46d37>] ? debug_smp_processor_id+0x17/0x20
> [  367.639044]  [<ffffffff8b1c2fae>] ? put_lock_stats.isra.12+0xe/0x30
> [  367.639047]  [<ffffffff8e51d100>] ? _raw_spin_unlock_irq+0x30/0x70
> [  367.639051]  [<ffffffff8b1639f4>] do_group_exit+0x84/0xd0
> [  367.639055]  [<ffffffff8b177657>] get_signal_to_deliver+0x807/0x910
> [  367.639059]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
> [  367.639063]  [<ffffffff8b0706c7>] do_signal+0x57/0x9a0
> [  367.639066]  [<ffffffff8b1a6eb8>] ? vtime_account_user+0x98/0xb0
> [  367.639070]  [<ffffffff8b19f228>] ? preempt_count_sub+0xd8/0x130
> [  367.639072]  [<ffffffff8b2848d5>] ? context_tracking_user_exit+0x1b5/0x260
> [  367.639078]  [<ffffffff8bb46d13>] ? __this_cpu_preempt_check+0x13/0x20
> [  367.639081]  [<ffffffff8b1c5df4>] ? trace_hardirqs_on_caller+0x1f4/0x290
> [  367.639087]  [<ffffffff8b07104a>] do_notify_resume+0x3a/0xb0
> [  367.639089]  [<ffffffff8e51e02a>] int_signal+0x12/0x17

I'm not sure whether I should ask you to attach the full dump or not:
I learnt very little from yesterday's 11MB, and from what you say this
one will look very similar.  Oh, I'd better ask for it, but I bet you
process these things much faster than I can manage.

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 17:55                   ` Hugh Dickins
  (?)
@ 2014-07-10 18:14                   ` Sasha Levin
  2014-07-10 18:52                       ` Hugh Dickins
  -1 siblings, 1 reply; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 18:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

On 07/10/2014 01:55 PM, Hugh Dickins wrote:
>> And finally, (not) holding the i_mmap_mutex:
> I don't understand what prompts you to show this particular task.
> I imagine the dump shows lots of other tasks which are waiting to get an
> i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> for nor holding an i_mmap_mutex.
> 
> Why are you showing this one in particular?  Because it looks like the
> one you fingered yesterday?  But I didn't see a good reason to finger
> that one either.

There are a few more tasks like this one, my criteria was tasks that lockdep
claims were holding i_mmap_mutex, but are actually not.

One new thing that I did notice is that since trinity spins a lot of new children
to test out things like execve() which would kill said children, there tends to
be a rather large amount of new tasks created and killed constantly.

So if you look at the bottom of the new log (attached), you'll see that there
are quite a few "trinity-subchild" processes trying to die, unsuccessfully.


Thanks,
Sasha



[-- Attachment #2: log.gz --]
[-- Type: application/gzip, Size: 354861 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 18:14                   ` Sasha Levin
@ 2014-07-10 18:52                       ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 18:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >> And finally, (not) holding the i_mmap_mutex:
> > I don't understand what prompts you to show this particular task.
> > I imagine the dump shows lots of other tasks which are waiting to get an
> > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> > for nor holding an i_mmap_mutex.
> > 
> > Why are you showing this one in particular?  Because it looks like the
> > one you fingered yesterday?  But I didn't see a good reason to finger
> > that one either.
> 
> There are a few more tasks like this one, my criteria was tasks that lockdep
> claims were holding i_mmap_mutex, but are actually not.

You and Vlastimil enlightened me yesterday that lockdep shows tasks as
holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
Hundreds of those in yesterday's log, hundreds of them in today's.

The full log you've sent (thanks) is for a different run from the one
you showed in today's mail.  No problem with that, except when I assert
that trinity-c190 in today's mail is just like trinity-c402 in yesterday's,
a task caught at one stage of exit_mmap in the stack dumps, then a later
stage of exit_mmap in the locks held dumps, I'm guessing rather than
confirming from the log.

There's nothing(?) interesting about those tasks, they're just tasks we
have been lucky to catch a moment before they reach the i_mmap_mutex
hang affecting the majority.

> 
> One new thing that I did notice is that since trinity spins a lot of new children
> to test out things like execve() which would kill said children, there tends to
> be a rather large amount of new tasks created and killed constantly.
> 
> So if you look at the bottom of the new log (attached), you'll see that there
> are quite a few "trinity-subchild" processes trying to die, unsuccessfully.

Lots of those in yesterday's log too: waiting to get i_mmap_mutex.

I'll pore over the new log.  It does help to know that its base kernel
is more stable: thanks so much.  But whether I can work out any more...

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 18:52                       ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 18:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >> And finally, (not) holding the i_mmap_mutex:
> > I don't understand what prompts you to show this particular task.
> > I imagine the dump shows lots of other tasks which are waiting to get an
> > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> > for nor holding an i_mmap_mutex.
> > 
> > Why are you showing this one in particular?  Because it looks like the
> > one you fingered yesterday?  But I didn't see a good reason to finger
> > that one either.
> 
> There are a few more tasks like this one, my criteria was tasks that lockdep
> claims were holding i_mmap_mutex, but are actually not.

You and Vlastimil enlightened me yesterday that lockdep shows tasks as
holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
Hundreds of those in yesterday's log, hundreds of them in today's.

The full log you've sent (thanks) is for a different run from the one
you showed in today's mail.  No problem with that, except when I assert
that trinity-c190 in today's mail is just like trinity-c402 in yesterday's,
a task caught at one stage of exit_mmap in the stack dumps, then a later
stage of exit_mmap in the locks held dumps, I'm guessing rather than
confirming from the log.

There's nothing(?) interesting about those tasks, they're just tasks we
have been lucky to catch a moment before they reach the i_mmap_mutex
hang affecting the majority.

> 
> One new thing that I did notice is that since trinity spins a lot of new children
> to test out things like execve() which would kill said children, there tends to
> be a rather large amount of new tasks created and killed constantly.
> 
> So if you look at the bottom of the new log (attached), you'll see that there
> are quite a few "trinity-subchild" processes trying to die, unsuccessfully.

Lots of those in yesterday's log too: waiting to get i_mmap_mutex.

I'll pore over the new log.  It does help to know that its base kernel
is more stable: thanks so much.  But whether I can work out any more...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 18:52                       ` Hugh Dickins
@ 2014-07-10 19:02                         ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 19:02 UTC (permalink / raw)
  To: Hugh Dickins, Peter Zijlstra
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
>> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
>>>> > >> And finally, (not) holding the i_mmap_mutex:
>>> > > I don't understand what prompts you to show this particular task.
>>> > > I imagine the dump shows lots of other tasks which are waiting to get an
>>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
>>> > > for nor holding an i_mmap_mutex.
>>> > > 
>>> > > Why are you showing this one in particular?  Because it looks like the
>>> > > one you fingered yesterday?  But I didn't see a good reason to finger
>>> > > that one either.
>> > 
>> > There are a few more tasks like this one, my criteria was tasks that lockdep
>> > claims were holding i_mmap_mutex, but are actually not.
> You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> Hundreds of those in yesterday's log, hundreds of them in today's.

What if we move lockdep's acquisition point to after it actually got the
lock?

We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
lockdep have anything built in to allow us to separate between locks which
we attempt to acquire and locks that are actually acquired?

(cc PeterZ)

We can treat locks that are in the process of being acquired the same as
acquired locks to avoid races, but when we print something out it would
be nice to have annotation of the read state of the lock.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 19:02                         ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 19:02 UTC (permalink / raw)
  To: Hugh Dickins, Peter Zijlstra
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
>> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
>>>> > >> And finally, (not) holding the i_mmap_mutex:
>>> > > I don't understand what prompts you to show this particular task.
>>> > > I imagine the dump shows lots of other tasks which are waiting to get an
>>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
>>> > > for nor holding an i_mmap_mutex.
>>> > > 
>>> > > Why are you showing this one in particular?  Because it looks like the
>>> > > one you fingered yesterday?  But I didn't see a good reason to finger
>>> > > that one either.
>> > 
>> > There are a few more tasks like this one, my criteria was tasks that lockdep
>> > claims were holding i_mmap_mutex, but are actually not.
> You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> Hundreds of those in yesterday's log, hundreds of them in today's.

What if we move lockdep's acquisition point to after it actually got the
lock?

We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
lockdep have anything built in to allow us to separate between locks which
we attempt to acquire and locks that are actually acquired?

(cc PeterZ)

We can treat locks that are in the process of being acquired the same as
acquired locks to avoid races, but when we print something out it would
be nice to have annotation of the read state of the lock.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 19:02                         ` Sasha Levin
@ 2014-07-10 19:06                           ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 19:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Peter Zijlstra, Heiko Carstens, Vlastimil Babka,
	akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >>>> > >> And finally, (not) holding the i_mmap_mutex:
> >>> > > I don't understand what prompts you to show this particular task.
> >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
> >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> >>> > > for nor holding an i_mmap_mutex.
> >>> > > 
> >>> > > Why are you showing this one in particular?  Because it looks like the
> >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
> >>> > > that one either.
> >> > 
> >> > There are a few more tasks like this one, my criteria was tasks that lockdep
> >> > claims were holding i_mmap_mutex, but are actually not.
> > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> > Hundreds of those in yesterday's log, hundreds of them in today's.
> 
> What if we move lockdep's acquisition point to after it actually got the
> lock?
> 
> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> lockdep have anything built in to allow us to separate between locks which
> we attempt to acquire and locks that are actually acquired?
> 
> (cc PeterZ)
> 
> We can treat locks that are in the process of being acquired the same as
> acquired locks to avoid races, but when we print something out it would
> be nice to have annotation of the read state of the lock.

I certainly hope someone can work on improving that.  I imagine it would
be easy, and well worth doing.  But won't be looking into it myself.

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 19:06                           ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 19:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Peter Zijlstra, Heiko Carstens, Vlastimil Babka,
	akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >>>> > >> And finally, (not) holding the i_mmap_mutex:
> >>> > > I don't understand what prompts you to show this particular task.
> >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
> >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> >>> > > for nor holding an i_mmap_mutex.
> >>> > > 
> >>> > > Why are you showing this one in particular?  Because it looks like the
> >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
> >>> > > that one either.
> >> > 
> >> > There are a few more tasks like this one, my criteria was tasks that lockdep
> >> > claims were holding i_mmap_mutex, but are actually not.
> > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> > Hundreds of those in yesterday's log, hundreds of them in today's.
> 
> What if we move lockdep's acquisition point to after it actually got the
> lock?
> 
> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> lockdep have anything built in to allow us to separate between locks which
> we attempt to acquire and locks that are actually acquired?
> 
> (cc PeterZ)
> 
> We can treat locks that are in the process of being acquired the same as
> acquired locks to avoid races, but when we print something out it would
> be nice to have annotation of the read state of the lock.

I certainly hope someone can work on improving that.  I imagine it would
be easy, and well worth doing.  But won't be looking into it myself.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 19:06                           ` Hugh Dickins
@ 2014-07-10 19:09                             ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 19:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 03:06 PM, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
>> > On 07/10/2014 02:52 PM, Hugh Dickins wrote:
>>> > > On Thu, 10 Jul 2014, Sasha Levin wrote:
>>>>> > >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
>>>>>>>>> > >>>> > >> And finally, (not) holding the i_mmap_mutex:
>>>>>>> > >>> > > I don't understand what prompts you to show this particular task.
>>>>>>> > >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
>>>>>>> > >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
>>>>>>> > >>> > > for nor holding an i_mmap_mutex.
>>>>>>> > >>> > > 
>>>>>>> > >>> > > Why are you showing this one in particular?  Because it looks like the
>>>>>>> > >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
>>>>>>> > >>> > > that one either.
>>>>> > >> > 
>>>>> > >> > There are a few more tasks like this one, my criteria was tasks that lockdep
>>>>> > >> > claims were holding i_mmap_mutex, but are actually not.
>>> > > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
>>> > > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
>>> > > Hundreds of those in yesterday's log, hundreds of them in today's.
>> > 
>> > What if we move lockdep's acquisition point to after it actually got the
>> > lock?
>> > 
>> > We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> > lockdep have anything built in to allow us to separate between locks which
>> > we attempt to acquire and locks that are actually acquired?
>> > 
>> > (cc PeterZ)
>> > 
>> > We can treat locks that are in the process of being acquired the same as
>> > acquired locks to avoid races, but when we print something out it would
>> > be nice to have annotation of the read state of the lock.
> I certainly hope someone can work on improving that.  I imagine it would
> be easy, and well worth doing.  But won't be looking into it myself.

I'd be happy to work on that, just want Peter to confirm that there's no reason
that this is missing right now.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 19:09                             ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-10 19:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On 07/10/2014 03:06 PM, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
>> > On 07/10/2014 02:52 PM, Hugh Dickins wrote:
>>> > > On Thu, 10 Jul 2014, Sasha Levin wrote:
>>>>> > >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
>>>>>>>>> > >>>> > >> And finally, (not) holding the i_mmap_mutex:
>>>>>>> > >>> > > I don't understand what prompts you to show this particular task.
>>>>>>> > >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
>>>>>>> > >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
>>>>>>> > >>> > > for nor holding an i_mmap_mutex.
>>>>>>> > >>> > > 
>>>>>>> > >>> > > Why are you showing this one in particular?  Because it looks like the
>>>>>>> > >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
>>>>>>> > >>> > > that one either.
>>>>> > >> > 
>>>>> > >> > There are a few more tasks like this one, my criteria was tasks that lockdep
>>>>> > >> > claims were holding i_mmap_mutex, but are actually not.
>>> > > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
>>> > > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
>>> > > Hundreds of those in yesterday's log, hundreds of them in today's.
>> > 
>> > What if we move lockdep's acquisition point to after it actually got the
>> > lock?
>> > 
>> > We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> > lockdep have anything built in to allow us to separate between locks which
>> > we attempt to acquire and locks that are actually acquired?
>> > 
>> > (cc PeterZ)
>> > 
>> > We can treat locks that are in the process of being acquired the same as
>> > acquired locks to avoid races, but when we print something out it would
>> > be nice to have annotation of the read state of the lock.
> I certainly hope someone can work on improving that.  I imagine it would
> be easy, and well worth doing.  But won't be looking into it myself.

I'd be happy to work on that, just want Peter to confirm that there's no reason
that this is missing right now.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 19:09                             ` Sasha Levin
@ 2014-07-10 19:56                               ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 19:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Peter Zijlstra, Heiko Carstens, Vlastimil Babka,
	akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 03:06 PM, Hugh Dickins wrote:
> > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >> > On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> >>> > > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >>>>> > >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >>>>>>>>> > >>>> > >> And finally, (not) holding the i_mmap_mutex:
> >>>>>>> > >>> > > I don't understand what prompts you to show this particular task.
> >>>>>>> > >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
> >>>>>>> > >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> >>>>>>> > >>> > > for nor holding an i_mmap_mutex.
> >>>>>>> > >>> > > 
> >>>>>>> > >>> > > Why are you showing this one in particular?  Because it looks like the
> >>>>>>> > >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
> >>>>>>> > >>> > > that one either.
> >>>>> > >> > 
> >>>>> > >> > There are a few more tasks like this one, my criteria was tasks that lockdep
> >>>>> > >> > claims were holding i_mmap_mutex, but are actually not.
> >>> > > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> >>> > > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> >>> > > Hundreds of those in yesterday's log, hundreds of them in today's.
> >> > 
> >> > What if we move lockdep's acquisition point to after it actually got the
> >> > lock?
> >> > 
> >> > We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> >> > lockdep have anything built in to allow us to separate between locks which
> >> > we attempt to acquire and locks that are actually acquired?
> >> > 
> >> > (cc PeterZ)
> >> > 
> >> > We can treat locks that are in the process of being acquired the same as
> >> > acquired locks to avoid races, but when we print something out it would
> >> > be nice to have annotation of the read state of the lock.
> > I certainly hope someone can work on improving that.  I imagine it would
> > be easy, and well worth doing.  But won't be looking into it myself.
> 
> I'd be happy to work on that, just want Peter to confirm that there's no reason
> that this is missing right now.

Great, thanks.  And for this bug (and many others?) it would also be very
helpful if those waiting on a mutex show the current mutex owner's pid.

Don't worry about getting a final mergeable patch, covering all lock types:
just something hacked up to show that i_mmap_mutex owner would help a lot.

But be careful, maybe owner is corrupted, or contains a now-invalid
address, or points to something no longer a task_struct.

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 19:56                               ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 19:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Peter Zijlstra, Heiko Carstens, Vlastimil Babka,
	akpm, davej, koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Sasha Levin wrote:
> On 07/10/2014 03:06 PM, Hugh Dickins wrote:
> > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >> > On 07/10/2014 02:52 PM, Hugh Dickins wrote:
> >>> > > On Thu, 10 Jul 2014, Sasha Levin wrote:
> >>>>> > >> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> >>>>>>>>> > >>>> > >> And finally, (not) holding the i_mmap_mutex:
> >>>>>>> > >>> > > I don't understand what prompts you to show this particular task.
> >>>>>>> > >>> > > I imagine the dump shows lots of other tasks which are waiting to get an
> >>>>>>> > >>> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> >>>>>>> > >>> > > for nor holding an i_mmap_mutex.
> >>>>>>> > >>> > > 
> >>>>>>> > >>> > > Why are you showing this one in particular?  Because it looks like the
> >>>>>>> > >>> > > one you fingered yesterday?  But I didn't see a good reason to finger
> >>>>>>> > >>> > > that one either.
> >>>>> > >> > 
> >>>>> > >> > There are a few more tasks like this one, my criteria was tasks that lockdep
> >>>>> > >> > claims were holding i_mmap_mutex, but are actually not.
> >>> > > You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> >>> > > holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> >>> > > Hundreds of those in yesterday's log, hundreds of them in today's.
> >> > 
> >> > What if we move lockdep's acquisition point to after it actually got the
> >> > lock?
> >> > 
> >> > We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> >> > lockdep have anything built in to allow us to separate between locks which
> >> > we attempt to acquire and locks that are actually acquired?
> >> > 
> >> > (cc PeterZ)
> >> > 
> >> > We can treat locks that are in the process of being acquired the same as
> >> > acquired locks to avoid races, but when we print something out it would
> >> > be nice to have annotation of the read state of the lock.
> > I certainly hope someone can work on improving that.  I imagine it would
> > be easy, and well worth doing.  But won't be looking into it myself.
> 
> I'd be happy to work on that, just want Peter to confirm that there's no reason
> that this is missing right now.

Great, thanks.  And for this bug (and many others?) it would also be very
helpful if those waiting on a mutex show the current mutex owner's pid.

Don't worry about getting a final mergeable patch, covering all lock types:
just something hacked up to show that i_mmap_mutex owner would help a lot.

But be careful, maybe owner is corrupted, or contains a now-invalid
address, or points to something no longer a task_struct.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 18:52                       ` Hugh Dickins
@ 2014-07-10 20:06                         ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 20:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On Thu, 10 Jul 2014, Hugh Dickins wrote:
> 
> I'll pore over the new log.  It does help to know that its base kernel
> is more stable: thanks so much.  But whether I can work out any more...

Actually, today's log is not much use to me: for a tenth of a second
it just shows "NNN printk messages dropped" instead of task traces.
Do you have a better one?

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-10 20:06                         ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-10 20:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On Thu, 10 Jul 2014, Hugh Dickins wrote:
> 
> I'll pore over the new log.  It does help to know that its base kernel
> is more stable: thanks so much.  But whether I can work out any more...

Actually, today's log is not much use to me: for a tenth of a second
it just shows "NNN printk messages dropped" instead of task traces.
Do you have a better one?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 18:52                       ` Hugh Dickins
@ 2014-07-11  6:59                         ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-11  6:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> > >> And finally, (not) holding the i_mmap_mutex:
> > > I don't understand what prompts you to show this particular task.
> > > I imagine the dump shows lots of other tasks which are waiting to get an
> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> > > for nor holding an i_mmap_mutex.
> > > 
> > > Why are you showing this one in particular?  Because it looks like the
> > > one you fingered yesterday?  But I didn't see a good reason to finger
> > > that one either.
> > 
> > There are a few more tasks like this one, my criteria was tasks that lockdep
> > claims were holding i_mmap_mutex, but are actually not.
> 
> You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> Hundreds of those in yesterday's log, hundreds of them in today's.
> 
> The full log you've sent (thanks) is for a different run from the one
> you showed in today's mail.  No problem with that, except when I assert
> that trinity-c190 in today's mail is just like trinity-c402 in yesterday's,
> a task caught at one stage of exit_mmap in the stack dumps, then a later
> stage of exit_mmap in the locks held dumps, I'm guessing rather than
> confirming from the log.
> 
> There's nothing(?) interesting about those tasks, they're just tasks we
> have been lucky to catch a moment before they reach the i_mmap_mutex
> hang affecting the majority.
> 
> > 
> > One new thing that I did notice is that since trinity spins a lot of new children
> > to test out things like execve() which would kill said children, there tends to
> > be a rather large amount of new tasks created and killed constantly.
> > 
> > So if you look at the bottom of the new log (attached), you'll see that there
> > are quite a few "trinity-subchild" processes trying to die, unsuccessfully.
> 
> Lots of those in yesterday's log too: waiting to get i_mmap_mutex.
> 
> I'll pore over the new log.  It does help to know that its base kernel
> is more stable: thanks so much.  But whether I can work out any more...

In fact Thursday's log was good enough, and no need for the improved
lockdep messaging we talked about, not for this bug at least.

Not that I properly understand it yet, but at least I identified the
task holding the i_mmap_mutex in Wednesday's and in Thursday's log.
So very obvious that I'm embarrassed even to pass on the info: I
pretty much said which the task was, without realizing it myself.

In each log there was only one task down below unmap_mapping_range().

Wednesday on linux-next-based 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775

trinity-c235    R  running task    12216  9169   8558 0x10000002
 ffff8800bbf978a8 0000000000000002 ffff88010cfe3290 0000000000000282
 ffff8800bbf97fd8 00000000001e2740 00000000001e2740 00000000001e2740
 ffff8800bdb03000 ffff8800bbf2b000 ffff8800bbf978a8 ffff8800bbf97fd8
Call Trace:
preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)  (that's the __preempt_count_sub line after __schedule, I believe)
___preempt_schedule (arch/x86/kernel/preempt.S:11)
? zap_pte_range (mm/memory.c:1218)
? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
zap_pte_range (mm/memory.c:1218)
unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1427)
? unmap_mapping_range (mm/memory.c:2392)
? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
unmap_mapping_range (mm/memory.c:2317 mm/memory.c:2393)
truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
shmem_undo_range (mm/shmem.c:441)
shmem_truncate_range (mm/shmem.c:537)
shmem_fallocate (mm/shmem.c:1771)
? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
tracesys (arch/x86/kernel/entry_64.S:542)

Thursday on Linus-based 3.16.0-rc4-sasha-00069-g615ded7-dirty #793

trinity-c100    R  running task    13048  8967   8490 0x00000006
 ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
 ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
 ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
Call Trace:
preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912) (that's the local_irq_disable line after __schedule, I believe)
retint_kernel (arch/x86/kernel/entry_64.S:937)
? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
? get_parent_ip (kernel/sched/core.c:2546)
? unmap_mapping_range (mm/memory.c:2391)
unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
shmem_undo_range (mm/shmem.c:429)
shmem_truncate_range (mm/shmem.c:528)
shmem_fallocate (mm/shmem.c:1749)
? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
tracesys (arch/x86/kernel/entry_64.S:543)

At first I thought the preempt_schedule[_irq] lines were exciting,
but seeing them in other traces, I now suppose they're nothing but
an artifact of how your watchdog gets the traces of running tasks.
(I wonder why so many tasks are shown as "running" when they're not,
but I don't think that has any bearing on the issue at hand.)

So, in each case we have a task which _appears_ to be stuck in the
i_mmap tree, holding i_mmap_mutex inside i_mutex.

I wondered if there were some interval tree corruption or bug, that
gets it stuck in a cycle.  But it seems unlikely that such a bug
would manifest always here and never under other conditions.

I've looked back at your original December and February reports in
the "mm: shm: hang in shmem_fallocate" thread.  I think "chastening"
is the word for how similar those traces are to what I show above.
I feel I have come full circle and made exactly 0 progress since then.

And I notice your remark from back then: "To me it seems like a series
of calls to shmem_truncate_range() takes so long that one of the tasks
triggers a hung task.  We don't actually hang in any specific
shmem_truncate_range() for too long though."

I think I'm coming around to that view: that trinity's forks drive
the i_mmap tree big enough, that the work to remove a single page
from that tree (while holding both i_mmap_mutex and i_mutex) takes
so long that as soon as i_mmap_mutex is dropped, a large number of
forks and exits waiting for that i_mmap_mutex come in, and if another
page gets mapped and has to be removed, our hole-punch will have to
wait a long time behind them to get the i_mmap_mutex again, all the
while holding i_mutex to make matters even worse for the other
waiting hole-punchers.

Or that's the picture I'm currently playing with, anyway.  And what
to do about it?  Dunno yet: maybe something much closer to the patch
that's already in Linus's tree (but buggy because of grabbing i_mutex
itself).  I'll mull over it.  But maybe this picture is rubbish:
I've made enough mistakes already, no surprise if this is wrong too.

And as soon as I write that, I remember something else to check,
and it does seem interesting, leading off in another direction.

Your Wednesday log took 5 advancing snapshots of the same hang,
across 4 seconds.  Above I've shown trinity-c235 from the first
of them.  But the following 4 are identical apart from timestamps.
Identical, when I'd expect it to be caught at different places in
its loop, not every time in zap_pte_range at mm/memory.c:1218
(which is the closing brace of zap_pte_range in my tree: your
line numbering is sometimes helpful, but sometimes puzzling:
I might prefer zap_pte_range+0xXXX/0xYYY in this case).

Or, thinking again on those preempt_schedule()s, maybe it's just an
artifact: that the major work in the loop is done under ptl spinlock,
so if you have to preempt to gather the trace, the most likely place
to be preempted is immediately after dropping the spinlock.

Yes, that's probably why those five are identical.
I'll go back to mulling over the starvation theory.

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11  6:59                         ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-11  6:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, 10 Jul 2014, Hugh Dickins wrote:
> On Thu, 10 Jul 2014, Sasha Levin wrote:
> > On 07/10/2014 01:55 PM, Hugh Dickins wrote:
> > >> And finally, (not) holding the i_mmap_mutex:
> > > I don't understand what prompts you to show this particular task.
> > > I imagine the dump shows lots of other tasks which are waiting to get an
> > > i_mmap_mutex, and quite a lot of other tasks which are neither waiting
> > > for nor holding an i_mmap_mutex.
> > > 
> > > Why are you showing this one in particular?  Because it looks like the
> > > one you fingered yesterday?  But I didn't see a good reason to finger
> > > that one either.
> > 
> > There are a few more tasks like this one, my criteria was tasks that lockdep
> > claims were holding i_mmap_mutex, but are actually not.
> 
> You and Vlastimil enlightened me yesterday that lockdep shows tasks as
> holding i_mmap_mutex when they are actually waiting to get i_mmap_mutex.
> Hundreds of those in yesterday's log, hundreds of them in today's.
> 
> The full log you've sent (thanks) is for a different run from the one
> you showed in today's mail.  No problem with that, except when I assert
> that trinity-c190 in today's mail is just like trinity-c402 in yesterday's,
> a task caught at one stage of exit_mmap in the stack dumps, then a later
> stage of exit_mmap in the locks held dumps, I'm guessing rather than
> confirming from the log.
> 
> There's nothing(?) interesting about those tasks, they're just tasks we
> have been lucky to catch a moment before they reach the i_mmap_mutex
> hang affecting the majority.
> 
> > 
> > One new thing that I did notice is that since trinity spins a lot of new children
> > to test out things like execve() which would kill said children, there tends to
> > be a rather large amount of new tasks created and killed constantly.
> > 
> > So if you look at the bottom of the new log (attached), you'll see that there
> > are quite a few "trinity-subchild" processes trying to die, unsuccessfully.
> 
> Lots of those in yesterday's log too: waiting to get i_mmap_mutex.
> 
> I'll pore over the new log.  It does help to know that its base kernel
> is more stable: thanks so much.  But whether I can work out any more...

In fact Thursday's log was good enough, and no need for the improved
lockdep messaging we talked about, not for this bug at least.

Not that I properly understand it yet, but at least I identified the
task holding the i_mmap_mutex in Wednesday's and in Thursday's log.
So very obvious that I'm embarrassed even to pass on the info: I
pretty much said which the task was, without realizing it myself.

In each log there was only one task down below unmap_mapping_range().

Wednesday on linux-next-based 3.16.0-rc4-next-20140709-sasha-00024-gd22103d-dirty #775

trinity-c235    R  running task    12216  9169   8558 0x10000002
 ffff8800bbf978a8 0000000000000002 ffff88010cfe3290 0000000000000282
 ffff8800bbf97fd8 00000000001e2740 00000000001e2740 00000000001e2740
 ffff8800bdb03000 ffff8800bbf2b000 ffff8800bbf978a8 ffff8800bbf97fd8
Call Trace:
preempt_schedule (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:2889)  (that's the __preempt_count_sub line after __schedule, I believe)
___preempt_schedule (arch/x86/kernel/preempt.S:11)
? zap_pte_range (mm/memory.c:1218)
? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
? _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
zap_pte_range (mm/memory.c:1218)
unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301 mm/memory.c:1346)
zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1427)
? unmap_mapping_range (mm/memory.c:2392)
? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
unmap_mapping_range (mm/memory.c:2317 mm/memory.c:2393)
truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
shmem_undo_range (mm/shmem.c:441)
shmem_truncate_range (mm/shmem.c:537)
shmem_fallocate (mm/shmem.c:1771)
? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
SyS_madvise (mm/madvise.c:332 mm/madvise.c:381 mm/madvise.c:531 mm/madvise.c:462)
tracesys (arch/x86/kernel/entry_64.S:542)

Thursday on Linus-based 3.16.0-rc4-sasha-00069-g615ded7-dirty #793

trinity-c100    R  running task    13048  8967   8490 0x00000006
 ffff88001b903978 0000000000000002 0000000000000006 ffff880404666fd8
 ffff88001b903fd8 00000000001d7740 00000000001d7740 00000000001d7740
 ffff880007a40000 ffff88001b8f8000 ffff88001b903968 ffff88001b903fd8
Call Trace:
preempt_schedule_irq (./arch/x86/include/asm/paravirt.h:814 kernel/sched/core.c:2912) (that's the local_irq_disable line after __schedule, I believe)
retint_kernel (arch/x86/kernel/entry_64.S:937)
? unmap_single_vma (mm/memory.c:1230 mm/memory.c:1277 mm/memory.c:1302 mm/memory.c:1348)
? unmap_single_vma (mm/memory.c:1297 mm/memory.c:1348)
zap_page_range_single (include/linux/mmu_notifier.h:234 mm/memory.c:1429)
? get_parent_ip (kernel/sched/core.c:2546)
? unmap_mapping_range (mm/memory.c:2391)
unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
shmem_undo_range (mm/shmem.c:429)
shmem_truncate_range (mm/shmem.c:528)
shmem_fallocate (mm/shmem.c:1749)
? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
tracesys (arch/x86/kernel/entry_64.S:543)

At first I thought the preempt_schedule[_irq] lines were exciting,
but seeing them in other traces, I now suppose they're nothing but
an artifact of how your watchdog gets the traces of running tasks.
(I wonder why so many tasks are shown as "running" when they're not,
but I don't think that has any bearing on the issue at hand.)

So, in each case we have a task which _appears_ to be stuck in the
i_mmap tree, holding i_mmap_mutex inside i_mutex.

I wondered if there were some interval tree corruption or bug, that
gets it stuck in a cycle.  But it seems unlikely that such a bug
would manifest always here and never under other conditions.

I've looked back at your original December and February reports in
the "mm: shm: hang in shmem_fallocate" thread.  I think "chastening"
is the word for how similar those traces are to what I show above.
I feel I have come full circle and made exactly 0 progress since then.

And I notice your remark from back then: "To me it seems like a series
of calls to shmem_truncate_range() takes so long that one of the tasks
triggers a hung task.  We don't actually hang in any specific
shmem_truncate_range() for too long though."

I think I'm coming around to that view: that trinity's forks drive
the i_mmap tree big enough, that the work to remove a single page
from that tree (while holding both i_mmap_mutex and i_mutex) takes
so long that as soon as i_mmap_mutex is dropped, a large number of
forks and exits waiting for that i_mmap_mutex come in, and if another
page gets mapped and has to be removed, our hole-punch will have to
wait a long time behind them to get the i_mmap_mutex again, all the
while holding i_mutex to make matters even worse for the other
waiting hole-punchers.

Or that's the picture I'm currently playing with, anyway.  And what
to do about it?  Dunno yet: maybe something much closer to the patch
that's already in Linus's tree (but buggy because of grabbing i_mutex
itself).  I'll mull over it.  But maybe this picture is rubbish:
I've made enough mistakes already, no surprise if this is wrong too.

And as soon as I write that, I remember something else to check,
and it does seem interesting, leading off in another direction.

Your Wednesday log took 5 advancing snapshots of the same hang,
across 4 seconds.  Above I've shown trinity-c235 from the first
of them.  But the following 4 are identical apart from timestamps.
Identical, when I'd expect it to be caught at different places in
its loop, not every time in zap_pte_range at mm/memory.c:1218
(which is the closing brace of zap_pte_range in my tree: your
line numbering is sometimes helpful, but sometimes puzzling:
I might prefer zap_pte_range+0xXXX/0xYYY in this case).

Or, thinking again on those preempt_schedule()s, maybe it's just an
artifact: that the major work in the loop is done under ptl spinlock,
so if you have to preempt to gather the trace, the most likely place
to be preempted is immediately after dropping the spinlock.

Yes, that's probably why those five are identical.
I'll go back to mulling over the starvation theory.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-10 19:02                         ` Sasha Levin
@ 2014-07-11  8:25                           ` Peter Zijlstra
  -1 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11  8:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
> What if we move lockdep's acquisition point to after it actually got the
> lock?

NAK, you want to do deadlock detection _before_ you're stuck in a
deadlock.

> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> lockdep have anything built in to allow us to separate between locks which
> we attempt to acquire and locks that are actually acquired?
> 
> (cc PeterZ)
> 
> We can treat locks that are in the process of being acquired the same as
> acquired locks to avoid races, but when we print something out it would
> be nice to have annotation of the read state of the lock.

I'm missing the problem here I think.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11  8:25                           ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11  8:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
> What if we move lockdep's acquisition point to after it actually got the
> lock?

NAK, you want to do deadlock detection _before_ you're stuck in a
deadlock.

> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
> lockdep have anything built in to allow us to separate between locks which
> we attempt to acquire and locks that are actually acquired?
> 
> (cc PeterZ)
> 
> We can treat locks that are in the process of being acquired the same as
> acquired locks to avoid races, but when we print something out it would
> be nice to have annotation of the read state of the lock.

I'm missing the problem here I think.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11  8:25                           ` Peter Zijlstra
@ 2014-07-11  8:33                             ` Vlastimil Babka
  -1 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-11  8:33 UTC (permalink / raw)
  To: Peter Zijlstra, Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/11/2014 10:25 AM, Peter Zijlstra wrote:
> On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
>> What if we move lockdep's acquisition point to after it actually got the
>> lock?
>
> NAK, you want to do deadlock detection _before_ you're stuck in a
> deadlock.
>
>> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> lockdep have anything built in to allow us to separate between locks which
>> we attempt to acquire and locks that are actually acquired?
>>
>> (cc PeterZ)
>>
>> We can treat locks that are in the process of being acquired the same as
>> acquired locks to avoid races, but when we print something out it would
>> be nice to have annotation of the read state of the lock.
>
> I'm missing the problem here I think.

Quoting Hugh from previous mail in this thread:

>> >
>> > [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>> > [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>> > [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>> > [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>> > [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>> > [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>> > [  363.633879] Call Trace:
>> > [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>> > [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>> > [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>> > [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> > [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>> > [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> > [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>
>> > [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>> > [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>> > [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>> > [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>> > [  363.682900] 2 locks held by trinity-c327/9203:
>> > [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>> > [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> Am I reading that correctly?

The output looks like mutex #1 is already taken, but actually the 
process is sleeping when trying to take it. It appears that the output 
has taken mutex_acquire_nest() action into account, but doesn't 
distinguish if lock_acquired() already happened or not.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11  8:33                             ` Vlastimil Babka
  0 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-11  8:33 UTC (permalink / raw)
  To: Peter Zijlstra, Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/11/2014 10:25 AM, Peter Zijlstra wrote:
> On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
>> What if we move lockdep's acquisition point to after it actually got the
>> lock?
>
> NAK, you want to do deadlock detection _before_ you're stuck in a
> deadlock.
>
>> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> lockdep have anything built in to allow us to separate between locks which
>> we attempt to acquire and locks that are actually acquired?
>>
>> (cc PeterZ)
>>
>> We can treat locks that are in the process of being acquired the same as
>> acquired locks to avoid races, but when we print something out it would
>> be nice to have annotation of the read state of the lock.
>
> I'm missing the problem here I think.

Quoting Hugh from previous mail in this thread:

>> >
>> > [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>> > [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>> > [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>> > [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>> > [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>> > [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>> > [  363.633879] Call Trace:
>> > [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>> > [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>> > [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>> > [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> > [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>> > [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>> > [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>
>> > [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>> > [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>> > [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>> > [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>> > [  363.682900] 2 locks held by trinity-c327/9203:
>> > [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>> > [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>
> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> Am I reading that correctly?

The output looks like mutex #1 is already taken, but actually the 
process is sleeping when trying to take it. It appears that the output 
has taken mutex_acquire_nest() action into account, but doesn't 
distinguish if lock_acquired() already happened or not.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11  8:33                             ` Vlastimil Babka
@ 2014-07-11  8:38                               ` Peter Zijlstra
  -1 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11  8:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, akpm, davej, koct9i,
	lczerner, stable, linux-mm, LKML

On Fri, Jul 11, 2014 at 10:33:15AM +0200, Vlastimil Babka wrote:
> Quoting Hugh from previous mail in this thread:
> 
> >>>
> >>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
> >>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
> >>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
> >>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
> >>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
> >>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
> >>> [  363.633879] Call Trace:
> >>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
> >>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
> >>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
> >>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> >>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
> >>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> >>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
> >
> >So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
> >
> >>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
> >>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
> >>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
> >>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
> >>> [  363.682900] 2 locks held by trinity-c327/9203:
> >>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
> >>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
> >
> >...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> >Am I reading that correctly?
> 
> The output looks like mutex #1 is already taken, but actually the process is
> sleeping when trying to take it. It appears that the output has taken
> mutex_acquire_nest() action into account, but doesn't distinguish if
> lock_acquired() already happened or not.

The call trace is very clear on it that its not. I've never found this
to be a problem in practise. You need to engage your brain anyhow, this
little bit extra isn't going to make a difference or not.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11  8:38                               ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11  8:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, akpm, davej, koct9i,
	lczerner, stable, linux-mm, LKML

On Fri, Jul 11, 2014 at 10:33:15AM +0200, Vlastimil Babka wrote:
> Quoting Hugh from previous mail in this thread:
> 
> >>>
> >>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
> >>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
> >>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
> >>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
> >>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
> >>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
> >>> [  363.633879] Call Trace:
> >>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
> >>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
> >>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
> >>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> >>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
> >>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
> >>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
> >
> >So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
> >
> >>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
> >>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
> >>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
> >>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
> >>> [  363.682900] 2 locks held by trinity-c327/9203:
> >>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
> >>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
> >
> >...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
> >Am I reading that correctly?
> 
> The output looks like mutex #1 is already taken, but actually the process is
> sleeping when trying to take it. It appears that the output has taken
> mutex_acquire_nest() action into account, but doesn't distinguish if
> lock_acquired() already happened or not.

The call trace is very clear on it that its not. I've never found this
to be a problem in practise. You need to engage your brain anyhow, this
little bit extra isn't going to make a difference or not.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11  8:38                               ` Peter Zijlstra
@ 2014-07-11  8:51                                 ` Vlastimil Babka
  -1 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-11  8:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, akpm, davej, koct9i,
	lczerner, stable, linux-mm, LKML

On 07/11/2014 10:38 AM, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 10:33:15AM +0200, Vlastimil Babka wrote:
>> Quoting Hugh from previous mail in this thread:
>>
>>>>>
>>>>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>>>>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>>>>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>>>>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>>>>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>>>>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>>>>> [  363.633879] Call Trace:
>>>>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>>>>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>>>>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>>>>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>>>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>>>>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>>>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>>
>>> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>>>
>>>>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>>>>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>>>>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>>>>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>>>>> [  363.682900] 2 locks held by trinity-c327/9203:
>>>>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>>>>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>>
>>> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
>>> Am I reading that correctly?
>>
>> The output looks like mutex #1 is already taken, but actually the process is
>> sleeping when trying to take it. It appears that the output has taken
>> mutex_acquire_nest() action into account, but doesn't distinguish if
>> lock_acquired() already happened or not.
>
> The call trace is very clear on it that its not. I've never found this
> to be a problem in practise. You need to engage your brain anyhow, this
> little bit extra isn't going to make a difference or not.

OK, but what about the case of "Showing all locks held in the system:"
output where you don't have the stacktraces? Wouldn't it be better if that
distinguished locks already taken and locks being taken?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11  8:51                                 ` Vlastimil Babka
  0 siblings, 0 replies; 56+ messages in thread
From: Vlastimil Babka @ 2014-07-11  8:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, akpm, davej, koct9i,
	lczerner, stable, linux-mm, LKML

On 07/11/2014 10:38 AM, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 10:33:15AM +0200, Vlastimil Babka wrote:
>> Quoting Hugh from previous mail in this thread:
>>
>>>>>
>>>>> [  363.600969] INFO: task trinity-c327:9203 blocked for more than 120 seconds.
>>>>> [  363.605359]       Not tainted 3.16.0-rc4-next-20140708-sasha-00022-g94c7290-dirty #772
>>>>> [  363.609730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> [  363.615861] trinity-c327    D 000000000000000b 13496  9203   8559 0x10000004
>>>>> [  363.620284]  ffff8800b857bce8 0000000000000002 ffffffff9dc11b10 0000000000000001
>>>>> [  363.624468]  ffff880104860000 ffff8800b857bfd8 00000000001d7740 00000000001d7740
>>>>> [  363.629118]  ffff880104863000 ffff880104860000 ffff8800b857bcd8 ffff8801eaed8868
>>>>> [  363.633879] Call Trace:
>>>>> [  363.635442]  [<ffffffff9a4dc535>] schedule+0x65/0x70
>>>>> [  363.638638]  [<ffffffff9a4dc948>] schedule_preempt_disabled+0x18/0x30
>>>>> [  363.642833]  [<ffffffff9a4df0a5>] mutex_lock_nested+0x2e5/0x550
>>>>> [  363.646599]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>>>> [  363.651319]  [<ffffffff9719b721>] ? get_parent_ip+0x11/0x50
>>>>> [  363.654683]  [<ffffffff972a4d7c>] ? shmem_fallocate+0x6c/0x350
>>>>> [  363.658264]  [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>>
>>> So it's trying to acquire i_mutex at shmem_fallocate+0x6c...
>>>
>>>>> [  363.662010]  [<ffffffff971bd96e>] ? put_lock_stats.isra.12+0xe/0x30
>>>>> [  363.665866]  [<ffffffff9730c043>] do_fallocate+0x153/0x1d0
>>>>> [  363.669381]  [<ffffffff972b472f>] SyS_madvise+0x33f/0x970
>>>>> [  363.672906]  [<ffffffff9a4e3f13>] tracesys+0xe1/0xe6
>>>>> [  363.682900] 2 locks held by trinity-c327/9203:
>>>>> [  363.684928]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff9730c02d>] do_fallocate+0x13d/0x1d0
>>>>> [  363.715102]  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<ffffffff972a4d7c>] shmem_fallocate+0x6c/0x350
>>>
>>> ...but it already holds i_mutex, acquired at shmem_fallocate+0x6c.
>>> Am I reading that correctly?
>>
>> The output looks like mutex #1 is already taken, but actually the process is
>> sleeping when trying to take it. It appears that the output has taken
>> mutex_acquire_nest() action into account, but doesn't distinguish if
>> lock_acquired() already happened or not.
>
> The call trace is very clear on it that its not. I've never found this
> to be a problem in practise. You need to engage your brain anyhow, this
> little bit extra isn't going to make a difference or not.

OK, but what about the case of "Showing all locks held in the system:"
output where you don't have the stacktraces? Wouldn't it be better if that
distinguished locks already taken and locks being taken?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11  8:25                           ` Peter Zijlstra
@ 2014-07-11 12:22                             ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-11 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On 07/11/2014 04:25 AM, Peter Zijlstra wrote:
> On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
>> What if we move lockdep's acquisition point to after it actually got the
>> lock?
> 
> NAK, you want to do deadlock detection _before_ you're stuck in a
> deadlock.

I didn't suggest to do it in the general case, but just for debugging the issue
we have here.

>> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> lockdep have anything built in to allow us to separate between locks which
>> we attempt to acquire and locks that are actually acquired?
>>
>> (cc PeterZ)
>>
>> We can treat locks that are in the process of being acquired the same as
>> acquired locks to avoid races, but when we print something out it would
>> be nice to have annotation of the read state of the lock.
> 
> I'm missing the problem here I think.

The problem here is that lockdep reports tasks waiting on lock as ones that
already have the lock. So we have a list of about 500 different tasks looking
like this:

[  367.805809] 2 locks held by trinity-c214/9083:
[  367.805811] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
[  367.805824] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)

While they haven't actually acquired i_mutex, but are merely blocking on it:

[  367.644150] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
[  367.644171]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
[  367.644178]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.644188]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
[  367.644213] Call Trace:
[  367.644218] schedule (kernel/sched/core.c:2832)
[  367.644229] schedule_preempt_disabled (kernel/sched/core.c:2859)
[  367.644237] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
[  367.644240] ? shmem_fallocate (mm/shmem.c:1738)
[  367.644248] ? get_parent_ip (kernel/sched/core.c:2546)
[  367.644255] ? shmem_fallocate (mm/shmem.c:1738)
[  367.644264] shmem_fallocate (mm/shmem.c:1738)
[  367.644268] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644280] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  367.644291] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644298] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  367.644303] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644309] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  367.644315] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  367.644321] tracesys (arch/x86/kernel/entry_64.S:543)

There's no easy way to see whether a given task is actually holding a lock or
is just blocking on it without going through all those tasks one by one and
looking at their trace.

I agree with you that "The call trace is very clear on it that its not", but
when you have 500 call traces you really want something better than going
through it one call trace at a time.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11 12:22                             ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-11 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On 07/11/2014 04:25 AM, Peter Zijlstra wrote:
> On Thu, Jul 10, 2014 at 03:02:29PM -0400, Sasha Levin wrote:
>> What if we move lockdep's acquisition point to after it actually got the
>> lock?
> 
> NAK, you want to do deadlock detection _before_ you're stuck in a
> deadlock.

I didn't suggest to do it in the general case, but just for debugging the issue
we have here.

>> We'd miss deadlocks, but we don't care about them right now. Anyways, doesn't
>> lockdep have anything built in to allow us to separate between locks which
>> we attempt to acquire and locks that are actually acquired?
>>
>> (cc PeterZ)
>>
>> We can treat locks that are in the process of being acquired the same as
>> acquired locks to avoid races, but when we print something out it would
>> be nice to have annotation of the read state of the lock.
> 
> I'm missing the problem here I think.

The problem here is that lockdep reports tasks waiting on lock as ones that
already have the lock. So we have a list of about 500 different tasks looking
like this:

[  367.805809] 2 locks held by trinity-c214/9083:
[  367.805811] #0: (sb_writers#9){.+.+.+}, at: do_fallocate (fs/open.c:298)
[  367.805824] #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: shmem_fallocate (mm/shmem.c:1738)

While they haven't actually acquired i_mutex, but are merely blocking on it:

[  367.644150] trinity-c214    D 0000000000000002 13528  9083   8490 0x00000000
[  367.644171]  ffff880018757ce8 0000000000000002 ffffffff91a01d70 0000000000000001
[  367.644178]  ffff880018757fd8 00000000001d7740 00000000001d7740 00000000001d7740
[  367.644188]  ffff880006428000 ffff880018758000 ffff880018757cd8 ffff880031fdc210
[  367.644213] Call Trace:
[  367.644218] schedule (kernel/sched/core.c:2832)
[  367.644229] schedule_preempt_disabled (kernel/sched/core.c:2859)
[  367.644237] mutex_lock_nested (kernel/locking/mutex.c:535 kernel/locking/mutex.c:587)
[  367.644240] ? shmem_fallocate (mm/shmem.c:1738)
[  367.644248] ? get_parent_ip (kernel/sched/core.c:2546)
[  367.644255] ? shmem_fallocate (mm/shmem.c:1738)
[  367.644264] shmem_fallocate (mm/shmem.c:1738)
[  367.644268] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644280] ? put_lock_stats.isra.12 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  367.644291] ? SyS_madvise (mm/madvise.c:334 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644298] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[  367.644303] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 mm/madvise.c:465)
[  367.644309] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  367.644315] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[  367.644321] tracesys (arch/x86/kernel/entry_64.S:543)

There's no easy way to see whether a given task is actually holding a lock or
is just blocking on it without going through all those tasks one by one and
looking at their trace.

I agree with you that "The call trace is very clear on it that its not", but
when you have 500 call traces you really want something better than going
through it one call trace at a time.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11 12:22                             ` Sasha Levin
@ 2014-07-11 14:55                               ` Hugh Dickins
  -1 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-11 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm,
	davej, koct9i, lczerner, stable, linux-mm, LKML

On Fri, 11 Jul 2014, Sasha Levin wrote:
> 
> There's no easy way to see whether a given task is actually holding a lock or
> is just blocking on it without going through all those tasks one by one and
> looking at their trace.
> 
> I agree with you that "The call trace is very clear on it that its not", but
> when you have 500 call traces you really want something better than going
> through it one call trace at a time.

Points well made, and I strongly agree with Vlastimil and Sasha.
There is a world of difference between a lock wanted and a lock held,
and for the display of locks "held" to conceal that difference is unhelpful.
It just needs one greppable word to distinguish the cases.

(Though I didn't find "The call trace is very clear on it that its not",
I thought it too was telling me that the lock was already held somehow.)

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11 14:55                               ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2014-07-11 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm,
	davej, koct9i, lczerner, stable, linux-mm, LKML

On Fri, 11 Jul 2014, Sasha Levin wrote:
> 
> There's no easy way to see whether a given task is actually holding a lock or
> is just blocking on it without going through all those tasks one by one and
> looking at their trace.
> 
> I agree with you that "The call trace is very clear on it that its not", but
> when you have 500 call traces you really want something better than going
> through it one call trace at a time.

Points well made, and I strongly agree with Vlastimil and Sasha.
There is a world of difference between a lock wanted and a lock held,
and for the display of locks "held" to conceal that difference is unhelpful.
It just needs one greppable word to distinguish the cases.

(Though I didn't find "The call trace is very clear on it that its not",
I thought it too was telling me that the lock was already held somehow.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11 14:55                               ` Hugh Dickins
@ 2014-07-11 15:59                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11 15:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sasha Levin, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Fri, Jul 11, 2014 at 07:55:50AM -0700, Hugh Dickins wrote:
> On Fri, 11 Jul 2014, Sasha Levin wrote:
> > 
> > There's no easy way to see whether a given task is actually holding a lock or
> > is just blocking on it without going through all those tasks one by one and
> > looking at their trace.
> > 
> > I agree with you that "The call trace is very clear on it that its not", but
> > when you have 500 call traces you really want something better than going
> > through it one call trace at a time.
> 
> Points well made, and I strongly agree with Vlastimil and Sasha.
> There is a world of difference between a lock wanted and a lock held,
> and for the display of locks "held" to conceal that difference is unhelpful.
> It just needs one greppable word to distinguish the cases.

So for the actual locking scenario it doesn't make a difference one way
or another. These threads all can/could/will acquire the lock
(eventually), so all their locking chains should be considered.

I realize that 500+ single lock 'chains' can be tedious, otoh they're
easy to dismiss, since singe lock 'chains' are trivial and usually not
interesting in their own right.

> (Though I didn't find "The call trace is very clear on it that its not",
> I thought it too was telling me that the lock was already held somehow.)

The trace is in the middle of the mutex op, if it were really fully
acquired it would not be, it would be doing something else -- while
holding the mutex.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-11 15:59                                 ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-11 15:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sasha Levin, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

On Fri, Jul 11, 2014 at 07:55:50AM -0700, Hugh Dickins wrote:
> On Fri, 11 Jul 2014, Sasha Levin wrote:
> > 
> > There's no easy way to see whether a given task is actually holding a lock or
> > is just blocking on it without going through all those tasks one by one and
> > looking at their trace.
> > 
> > I agree with you that "The call trace is very clear on it that its not", but
> > when you have 500 call traces you really want something better than going
> > through it one call trace at a time.
> 
> Points well made, and I strongly agree with Vlastimil and Sasha.
> There is a world of difference between a lock wanted and a lock held,
> and for the display of locks "held" to conceal that difference is unhelpful.
> It just needs one greppable word to distinguish the cases.

So for the actual locking scenario it doesn't make a difference one way
or another. These threads all can/could/will acquire the lock
(eventually), so all their locking chains should be considered.

I realize that 500+ single lock 'chains' can be tedious, otoh they're
easy to dismiss, since singe lock 'chains' are trivial and usually not
interesting in their own right.

> (Though I didn't find "The call trace is very clear on it that its not",
> I thought it too was telling me that the lock was already held somehow.)

The trace is in the middle of the mutex op, if it were really fully
acquired it would not be, it would be doing something else -- while
holding the mutex.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-11 15:59                                 ` Peter Zijlstra
@ 2014-07-13 21:43                                   ` Sasha Levin
  -1 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-13 21:43 UTC (permalink / raw)
  To: Peter Zijlstra, Hugh Dickins
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/11/2014 11:59 AM, Peter Zijlstra wrote:
>>> I agree with you that "The call trace is very clear on it that its not", but
>>> > > when you have 500 call traces you really want something better than going
>>> > > through it one call trace at a time.
>> > 
>> > Points well made, and I strongly agree with Vlastimil and Sasha.
>> > There is a world of difference between a lock wanted and a lock held,
>> > and for the display of locks "held" to conceal that difference is unhelpful.
>> > It just needs one greppable word to distinguish the cases.
> So for the actual locking scenario it doesn't make a difference one way
> or another. These threads all can/could/will acquire the lock
> (eventually), so all their locking chains should be considered.

I think that the difference here is that we're not actually debugging a locking
issue, we're merely using lockdep to help with figuring out a non-locking
related bug and finding it difficult because lockdep's list of "held locks"
is really a lie :)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
@ 2014-07-13 21:43                                   ` Sasha Levin
  0 siblings, 0 replies; 56+ messages in thread
From: Sasha Levin @ 2014-07-13 21:43 UTC (permalink / raw)
  To: Peter Zijlstra, Hugh Dickins
  Cc: Heiko Carstens, Vlastimil Babka, akpm, davej, koct9i, lczerner,
	stable, linux-mm, LKML

On 07/11/2014 11:59 AM, Peter Zijlstra wrote:
>>> I agree with you that "The call trace is very clear on it that its not", but
>>> > > when you have 500 call traces you really want something better than going
>>> > > through it one call trace at a time.
>> > 
>> > Points well made, and I strongly agree with Vlastimil and Sasha.
>> > There is a world of difference between a lock wanted and a lock held,
>> > and for the display of locks "held" to conceal that difference is unhelpful.
>> > It just needs one greppable word to distinguish the cases.
> So for the actual locking scenario it doesn't make a difference one way
> or another. These threads all can/could/will acquire the lock
> (eventually), so all their locking chains should be considered.

I think that the difference here is that we're not actually debugging a locking
issue, we're merely using lockdep to help with figuring out a non-locking
related bug and finding it difficult because lockdep's list of "held locks"
is really a lie :)


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree
  2014-07-13 21:43                                   ` Sasha Levin
  (?)
@ 2014-07-14 10:10                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-07-14 10:10 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Heiko Carstens, Vlastimil Babka, akpm, davej,
	koct9i, lczerner, stable, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

On Sun, Jul 13, 2014 at 05:43:13PM -0400, Sasha Levin wrote:
> On 07/11/2014 11:59 AM, Peter Zijlstra wrote:
> >>> I agree with you that "The call trace is very clear on it that its not", but
> >>> > > when you have 500 call traces you really want something better than going
> >>> > > through it one call trace at a time.
> >> > 
> >> > Points well made, and I strongly agree with Vlastimil and Sasha.
> >> > There is a world of difference between a lock wanted and a lock held,
> >> > and for the display of locks "held" to conceal that difference is unhelpful.
> >> > It just needs one greppable word to distinguish the cases.
> > So for the actual locking scenario it doesn't make a difference one way
> > or another. These threads all can/could/will acquire the lock
> > (eventually), so all their locking chains should be considered.
> 
> I think that the difference here is that we're not actually debugging a locking
> issue, we're merely using lockdep to help with figuring out a non-locking
> related bug and finding it difficult because lockdep's list of "held locks"
> is really a lie :)

OK, so I suppose we could document that the top lock might not actually
be held (yet). But then who will ever read said document ;-)

The problem with 'fixing' this is that we don't exactly have spare
storage in struct held_lock, and lock_acquired() is only enabled for
CONFIG_LOCK_STAT.

I just don't think its worth it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2014-07-14 10:10 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 19:25 + shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch added to -mm tree akpm
2014-07-09  4:03 ` Sasha Levin
2014-07-09  4:03   ` Sasha Levin
2014-07-09  6:35   ` Hugh Dickins
2014-07-09  6:35     ` Hugh Dickins
2014-07-09  9:50     ` Vlastimil Babka
2014-07-09  9:50       ` Vlastimil Babka
2014-07-09 12:47       ` Sasha Levin
2014-07-09 12:47         ` Sasha Levin
2014-07-09 16:03         ` Sasha Levin
2014-07-09 16:35           ` Vlastimil Babka
2014-07-09 16:35             ` Vlastimil Babka
2014-07-09 17:05             ` Hugh Dickins
2014-07-09 17:05               ` Hugh Dickins
2014-07-10  1:04               ` Hugh Dickins
2014-07-10  1:04                 ` Hugh Dickins
2014-07-10  7:37           ` Hugh Dickins
2014-07-10  7:37             ` Hugh Dickins
2014-07-10 12:46             ` Sasha Levin
2014-07-10 12:46               ` Sasha Levin
2014-07-10 17:21               ` Sasha Levin
2014-07-10 17:21                 ` Sasha Levin
2014-07-10 17:55                 ` Hugh Dickins
2014-07-10 17:55                   ` Hugh Dickins
2014-07-10 18:14                   ` Sasha Levin
2014-07-10 18:52                     ` Hugh Dickins
2014-07-10 18:52                       ` Hugh Dickins
2014-07-10 19:02                       ` Sasha Levin
2014-07-10 19:02                         ` Sasha Levin
2014-07-10 19:06                         ` Hugh Dickins
2014-07-10 19:06                           ` Hugh Dickins
2014-07-10 19:09                           ` Sasha Levin
2014-07-10 19:09                             ` Sasha Levin
2014-07-10 19:56                             ` Hugh Dickins
2014-07-10 19:56                               ` Hugh Dickins
2014-07-11  8:25                         ` Peter Zijlstra
2014-07-11  8:25                           ` Peter Zijlstra
2014-07-11  8:33                           ` Vlastimil Babka
2014-07-11  8:33                             ` Vlastimil Babka
2014-07-11  8:38                             ` Peter Zijlstra
2014-07-11  8:38                               ` Peter Zijlstra
2014-07-11  8:51                               ` Vlastimil Babka
2014-07-11  8:51                                 ` Vlastimil Babka
2014-07-11 12:22                           ` Sasha Levin
2014-07-11 12:22                             ` Sasha Levin
2014-07-11 14:55                             ` Hugh Dickins
2014-07-11 14:55                               ` Hugh Dickins
2014-07-11 15:59                               ` Peter Zijlstra
2014-07-11 15:59                                 ` Peter Zijlstra
2014-07-13 21:43                                 ` Sasha Levin
2014-07-13 21:43                                   ` Sasha Levin
2014-07-14 10:10                                   ` Peter Zijlstra
2014-07-10 20:06                       ` Hugh Dickins
2014-07-10 20:06                         ` Hugh Dickins
2014-07-11  6:59                       ` Hugh Dickins
2014-07-11  6:59                         ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.