linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XFS memory allocation deadlock in 2.6.38
@ 2011-03-21 16:19 Sean Noonan
  2011-03-23 19:39 ` Sean Noonan
  2011-03-27 18:11 ` Maciej Rutecki
  0 siblings, 2 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-21 16:19 UTC (permalink / raw)
  To: 'linux-kernel@vger.kernel.org'
  Cc: Trammell Hudson, Martin Bligh, Stephen Degler, Christos Zoulas

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

This message was originally posted to the XFS mailing list, but received no responses.  Thus, I am sending it to LKML on the advice of Martin.

Using the attached program, we are able to reproduce this bug reliably.
$ make vmtest
$ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
/xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
749660: avg 13339 max 234667 ticks
371945: avg 26885 max 281616 ticks
---
At this point, we see the following on the console:
[593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

This is the same message presented in
http://oss.sgi.com/bugzilla/show_bug.cgi?id=410

We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release.  This does not appear to be present in 2.6.33, but we have not done testing in between.  We have tested with ext4 and do not encounter this bug.
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_VXFS_FS is not set

Here is the stack from the process:
[<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
[<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
[<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
[<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
[<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
[<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
[<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
[<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
[<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
[<ffffffff8102e399>] do_page_fault+0x159/0x470
[<ffffffff816cf6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

# uname -a
Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux

Please let me know if additional information is required.

Thanks!

Sean

[-- Attachment #2: vmtest.c --]
[-- Type: text/plain, Size: 2185 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdint.h>
#include <inttypes.h>
#include <errno.h>
#include <fcntl.h>
#include <err.h>


static inline uint64_t
rdtsc(void)
{
    uint32_t low, high;
    __asm__ __volatile__("rdtsc" : "=a"(low), "=d"(high));
    return low | ((uint64_t) high) << 32;
}


void *
mmapfile(
    const char * filename,
    uint64_t len
)
{
    int perms = 0666;
    int open_flag = O_RDWR | O_CREAT;
    int mmap_flags = PROT_READ | PROT_WRITE;

    const int fd = open(filename, open_flag, perms);
    if (fd < 0)
	goto fail;

    // Ensure that the file is empty and the right size
    if (ftruncate(fd, 0) < 0)
	goto fail;

    if (ftruncate(fd, len) < 0)
	goto fail;

    // Map the entire actual length of the file
    void * const base = mmap(
	NULL,
	len,
	mmap_flags,
	MAP_SHARED | MAP_POPULATE,
	fd,
	0
    );
    if (base == MAP_FAILED)
	goto fail;

    close(fd);
    return base;

fail:
    err(1, "%s: Unable to map %"PRIu64" bytes", filename, len);
}


int main(
    int argc,
    char ** argv
)
{
    const char * filename = argv[1];
    const uint64_t len = argc > 2 ? strtoul(argv[2], NULL, 0) : (5ul << 30);
    const uint64_t max_index = len / sizeof(uint64_t);

    uint64_t mmap_time = -rdtsc();
    uint64_t * const buf = mmapfile(filename, len);
    mmap_time += rdtsc();
    fprintf(stderr, "%s: mapped %"PRIu64" bytes in %"PRIu64" ticks\n",
	filename,
	len,
	mmap_time
    );

    while (1)
    {
	uint64_t max = 0;
	uint64_t sum = 0;
	uint64_t i;
	const uint64_t loop_start = rdtsc();
	const uint64_t iters = 1 << 30;

	uint64_t start = loop_start;
	for (i = 0 ; i < iters ; i++)
	{
	    uint64_t i = lrand48() % max_index;
	    buf[i] += start;

	    uint64_t end = rdtsc();
	    const uint64_t delta = end - start;
	    start = end;

	    sum += delta;
	    if (delta > max)
		max = delta;

	    // Force a report every 10 billion ticks ~= 3 seconds
	    if (end - loop_start > 10e9)
		break;
	}

	printf("%"PRIu64": avg %"PRIu64" max %"PRIu64" ticks\n",
	    i,
	    i ? sum / i : 0,
	    max
	);
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-21 16:19 XFS memory allocation deadlock in 2.6.38 Sean Noonan
@ 2011-03-23 19:39 ` Sean Noonan
  2011-03-24 17:43   ` Christoph Hellwig
  2011-03-27 18:11 ` Maciej Rutecki
  1 sibling, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-23 19:39 UTC (permalink / raw)
  To: Sean Noonan, 'linux-kernel@vger.kernel.org'
  Cc: Trammell Hudson, Martin Bligh, Stephen Degler, Christos Zoulas,
	'linux-xfs@oss.sgi.com'

I believe this patch fixes the behavior:
diff --git a/mm/memory.c b/mm/memory.c
index e48945a..740d5ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,7 +3461,9 @@ int make_pages_present(unsigned long addr, unsigned long end)
         * to break COW, except for shared mappings because these don't COW
         * and we would not want to dirty them for nothing.
         */
-       write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
+       write = (vma->vm_flags & VM_WRITE) != 0;
+       if (write && ((vma->vm_flags & VM_SHARED) !=0) && (vma->vm_file == NULL))
+               write = 0;
        BUG_ON(addr >= end);
        BUG_ON(end > vma->vm_end);
        len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;


This was traced to the following commit:
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 is the first bad commit
commit 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272
Author: Michel Lespinasse <walken@google.com>
Date:   Thu Jan 13 15:46:09 2011 -0800

    mlock: avoid dirtying pages and triggering writeback
    
    When faulting in pages for mlock(), we want to break COW for anonymous or
    file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
    need to write-fault into VM_SHARED vmas since shared file pages can be
    mlocked first and dirtied later, when/if they actually get written to.
    Skipping the write fault is desirable, as we don't want to unnecessarily
    cause these pages to be dirtied and queued for writeback.
    
    Signed-off-by: Michel Lespinasse <walken@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Theodore Tso <tytso@google.com>
    Cc: Michael Rubin <mrubin@google.com>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 604eede2f45b7e5276ce9725b715ed15a868861d 3c175eadf4cf33d4f78d4d455c9a04f3df2c199e M	mm


-----Original Message-----
From: Sean Noonan 
Sent: Monday, March 21, 2011 12:20
To: 'linux-kernel@vger.kernel.org'
Cc: Trammell Hudson; Martin Bligh; Stephen Degler; Christos Zoulas
Subject: XFS memory allocation deadlock in 2.6.38

This message was originally posted to the XFS mailing list, but received no responses.  Thus, I am sending it to LKML on the advice of Martin.

Using the attached program, we are able to reproduce this bug reliably.
$ make vmtest
$ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
/xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
749660: avg 13339 max 234667 ticks
371945: avg 26885 max 281616 ticks
---
At this point, we see the following on the console:
[593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

This is the same message presented in
http://oss.sgi.com/bugzilla/show_bug.cgi?id=410

We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release.  This does not appear to be present in 2.6.33, but we have not done testing in between.  We have tested with ext4 and do not encounter this bug.
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_VXFS_FS is not set

Here is the stack from the process:
[<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
[<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
[<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
[<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
[<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
[<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
[<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
[<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
[<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
[<ffffffff8102e399>] do_page_fault+0x159/0x470
[<ffffffff816cf6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

# uname -a
Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux

Please let me know if additional information is required.

Thanks!

Sean

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-23 19:39 ` Sean Noonan
@ 2011-03-24 17:43   ` Christoph Hellwig
  2011-03-24 23:45     ` Michel Lespinasse
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2011-03-24 17:43 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, walken, linux-mm

Michel,

can you take a look at this bug report?  It looks like a regression
in your mlock handling changes.


On Wed, Mar 23, 2011 at 03:39:05PM -0400, Sean Noonan wrote:
> I believe this patch fixes the behavior:
> diff --git a/mm/memory.c b/mm/memory.c
> index e48945a..740d5ab 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3461,7 +3461,9 @@ int make_pages_present(unsigned long addr, unsigned long end)
>          * to break COW, except for shared mappings because these don't COW
>          * and we would not want to dirty them for nothing.
>          */
> -       write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
> +       write = (vma->vm_flags & VM_WRITE) != 0;
> +       if (write && ((vma->vm_flags & VM_SHARED) !=0) && (vma->vm_file == NULL))
> +               write = 0;
>         BUG_ON(addr >= end);
>         BUG_ON(end > vma->vm_end);
>         len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
> 
> 
> This was traced to the following commit:
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 is the first bad commit
> commit 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272
> Author: Michel Lespinasse <walken@google.com>
> Date:   Thu Jan 13 15:46:09 2011 -0800
> 
>     mlock: avoid dirtying pages and triggering writeback
>     
>     When faulting in pages for mlock(), we want to break COW for anonymous or
>     file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
>     need to write-fault into VM_SHARED vmas since shared file pages can be
>     mlocked first and dirtied later, when/if they actually get written to.
>     Skipping the write fault is desirable, as we don't want to unnecessarily
>     cause these pages to be dirtied and queued for writeback.
>     
>     Signed-off-by: Michel Lespinasse <walken@google.com>
>     Cc: Hugh Dickins <hughd@google.com>
>     Cc: Rik van Riel <riel@redhat.com>
>     Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Nick Piggin <npiggin@kernel.dk>
>     Cc: Theodore Tso <tytso@google.com>
>     Cc: Michael Rubin <mrubin@google.com>
>     Cc: Suleiman Souhlal <suleiman@google.com>
>     Cc: Dave Chinner <david@fromorbit.com>
>     Cc: Christoph Hellwig <hch@infradead.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> :040000 040000 604eede2f45b7e5276ce9725b715ed15a868861d 3c175eadf4cf33d4f78d4d455c9a04f3df2c199e M	mm
> 
> 
> -----Original Message-----
> From: Sean Noonan 
> Sent: Monday, March 21, 2011 12:20
> To: 'linux-kernel@vger.kernel.org'
> Cc: Trammell Hudson; Martin Bligh; Stephen Degler; Christos Zoulas
> Subject: XFS memory allocation deadlock in 2.6.38
> 
> This message was originally posted to the XFS mailing list, but received no responses.  Thus, I am sending it to LKML on the advice of Martin.
> 
> Using the attached program, we are able to reproduce this bug reliably.
> $ make vmtest
> $ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
> /xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
> 749660: avg 13339 max 234667 ticks
> 371945: avg 26885 max 281616 ticks
> ---
> At this point, we see the following on the console:
> [593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> 
> This is the same message presented in
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
> 
> We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release.  This does not appear to be present in 2.6.33, but we have not done testing in between.  We have tested with ext4 and do not encounter this bug.
> CONFIG_XFS_FS=y
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> # CONFIG_XFS_DEBUG is not set
> # CONFIG_VXFS_FS is not set
> 
> Here is the stack from the process:
> [<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
> [<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
> [<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
> [<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
> [<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
> [<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
> [<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
> [<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
> [<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
> [<ffffffff8102e399>] do_page_fault+0x159/0x470
> [<ffffffff816cf6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # uname -a
> Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux
> 
> Please let me know if additional information is required.
> 
> Thanks!
> 
> Sean
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-24 17:43   ` Christoph Hellwig
@ 2011-03-24 23:45     ` Michel Lespinasse
  2011-03-28 14:58       ` Sean Noonan
  0 siblings, 1 reply; 27+ messages in thread
From: Michel Lespinasse @ 2011-03-24 23:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sean Noonan, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm

On Thu, Mar 24, 2011 at 10:43 AM, Christoph Hellwig <hch@infradead.org> wrote:
> Michel,
>
> can you take a look at this bug report?  It looks like a regression
> in your mlock handling changes.

I had a quick look and at this point I can describe how the patch will
affect behavior of this test, but not why this causes a deadlock with
xfs.

The test creates a writable, shared mapping of a file that does not
have data blocks allocated on disk, and also uses the MAP_POPULATE
flag.

Before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272, make_pages_present
during the mmap would cause data blocks to get allocated on disk with
an xfs_vm_page_mkwrite call, and then the file pages would get mapped
as writable ptes.

After 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272, make_pages_present
does NOT cause data blocks to get allocated on disk. Instead,
xfs_vm_readpages is called, which (I suppose) does not allocate the
data blocks and returns zero filled pages instead, which get mapped as
readonly ptes. Later, the test tries writing into the mmap'ed block,
causing minor page faults, xfs_vm_page_mkwrite calls and data block
allocations to occur.


Regarding the deadlock: I am curious to see if it could be made to
happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
what happens if you remove the MAP_POPULATE flag from your mmap call,
and instead read all pages from userspace right after the mmap ? I
expect you would then be able to trigger the deadlock before
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.


This leaves the issue of the change of behavior for MAP_POPULATE on
ftruncated file holes. I'm not sure what to say there though, because
MAP_POPULATE is documented to cause file read-ahead (and it still does
after 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272), but that doesn't say
anything about block allocation.


Hope this helps,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-21 16:19 XFS memory allocation deadlock in 2.6.38 Sean Noonan
  2011-03-23 19:39 ` Sean Noonan
@ 2011-03-27 18:11 ` Maciej Rutecki
  1 sibling, 0 replies; 27+ messages in thread
From: Maciej Rutecki @ 2011-03-27 18:11 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'linux-kernel@vger.kernel.org',
	Trammell Hudson, Martin Bligh, Stephen Degler, Christos Zoulas

I created a Bugzilla entry at 
https://bugzilla.kernel.org/show_bug.cgi?id=31982
for your bug report, please add your address to the CC list in there, thanks!

On poniedziałek, 21 marca 2011 o 17:19:44 Sean Noonan wrote:
> This message was originally posted to the XFS mailing list, but received no
> responses.  Thus, I am sending it to LKML on the advice of Martin.
> 
> Using the attached program, we are able to reproduce this bug reliably.
> $ make vmtest
> $ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest
> <path_to_file> <size_in_bytes> /xfs/hugefile.dat: mapped 17179869184 bytes
> in 33822066943 ticks
> 749660: avg 13339 max 234667 ticks
> 371945: avg 26885 max 281616 ticks
> ---
> At this point, we see the following on the console:
> [593492.694806] XFS: possible memory allocation deadlock in kmem_alloc
> (mode:0x250) [593506.724367] XFS: possible memory allocation deadlock in
> kmem_alloc (mode:0x250) [593524.837717] XFS: possible memory allocation
> deadlock in kmem_alloc (mode:0x250) [593556.742386] XFS: possible memory
> allocation deadlock in kmem_alloc (mode:0x250)
> 
> This is the same message presented in
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
> 
> We started testing with 2.6.38-rc7 and have seen this bug through to the .0
> release.  This does not appear to be present in 2.6.33, but we have not
> done testing in between.  We have tested with ext4 and do not encounter
> this bug. CONFIG_XFS_FS=y
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> # CONFIG_XFS_DEBUG is not set
> # CONFIG_VXFS_FS is not set
> 
> Here is the stack from the process:
> [<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
> [<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
> [<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
> [<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
> [<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
> [<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
> [<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
> [<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
> [<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
> [<ffffffff8102e399>] do_page_fault+0x159/0x470
> [<ffffffff816cf6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # uname -a
> Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64
> GNU/Linux
> 
> Please let me know if additional information is required.
> 
> Thanks!
> 
> Sean

-- 
Maciej Rutecki
http://www.maciek.unixy.pl

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-24 23:45     ` Michel Lespinasse
@ 2011-03-28 14:58       ` Sean Noonan
  2011-03-28 21:06         ` Michel Lespinasse
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-28 14:58 UTC (permalink / raw)
  To: 'Michel Lespinasse', Christoph Hellwig
  Cc: linux-kernel, Martin Bligh, Trammell Hudson, Christos Zoulas,
	linux-xfs, Stephen Degler, linux-mm

> Regarding the deadlock: I am curious to see if it could be made to
> happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
> what happens if you remove the MAP_POPULATE flag from your mmap call,
> and instead read all pages from userspace right after the mmap ? I
> expect you would then be able to trigger the deadlock before
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.

I still see the deadlock without MAP_POPULATE

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-28 14:58       ` Sean Noonan
@ 2011-03-28 21:06         ` Michel Lespinasse
  2011-03-28 21:34           ` Sean Noonan
  0 siblings, 1 reply; 27+ messages in thread
From: Michel Lespinasse @ 2011-03-28 21:06 UTC (permalink / raw)
  To: Sean Noonan
  Cc: Christoph Hellwig, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm

On Mon, Mar 28, 2011 at 7:58 AM, Sean Noonan <Sean.Noonan@twosigma.com> wrote:
>> Regarding the deadlock: I am curious to see if it could be made to
>> happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
>> what happens if you remove the MAP_POPULATE flag from your mmap call,
>> and instead read all pages from userspace right after the mmap ? I
>> expect you would then be able to trigger the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.
>
> I still see the deadlock without MAP_POPULATE

Could you test if you see the deadlock before
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-28 21:06         ` Michel Lespinasse
@ 2011-03-28 21:34           ` Sean Noonan
  2011-03-29  0:25             ` Michel Lespinasse
                               ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-28 21:34 UTC (permalink / raw)
  To: 'Michel Lespinasse'
  Cc: Christoph Hellwig, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm

> Could you test if you see the deadlock before
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?

Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
Confirmed that the original bug does not present in this version.
Confirmed that removing MAP_POPULATE does cause the deadlock to occur.

Here is the stack of the test:
# cat /proc/3846/stack
[<ffffffff812e8a64>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81271c1d>] xfs_ilock+0x9d/0x110
[<ffffffff81271cae>] xfs_ilock_map_shared+0x1e/0x50
[<ffffffff81294985>] __xfs_get_blocks+0xc5/0x4e0
[<ffffffff81294dcc>] xfs_get_blocks+0xc/0x10
[<ffffffff811322c2>] do_mpage_readpage+0x462/0x660
[<ffffffff8113250a>] mpage_readpage+0x4a/0x60
[<ffffffff81295433>] xfs_vm_readpage+0x13/0x20
[<ffffffff810bb850>] filemap_fault+0x2d0/0x4e0
[<ffffffff810d8680>] __do_fault+0x50/0x510
[<ffffffff810da542>] handle_mm_fault+0x1a2/0xe60
[<ffffffff8102a466>] do_page_fault+0x146/0x440
[<ffffffff8164e6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

xfssyncd is stuck in D state.
# cat /proc/2484/stack
[<ffffffff8106ee1c>] down+0x3c/0x50
[<ffffffff81297802>] xfs_buf_lock+0x72/0x170
[<ffffffff8128762d>] xfs_getsb+0x1d/0x50
[<ffffffff8128e6af>] xfs_trans_getsb+0x5f/0x150
[<ffffffff8128821e>] xfs_mod_sb+0x4e/0xe0
[<ffffffff8126e4ea>] xfs_fs_log_dummy+0x5a/0xb0
[<ffffffff812a2a13>] xfs_sync_worker+0x83/0x90
[<ffffffff812a28e2>] xfssyncd+0x172/0x220
[<ffffffff81069576>] kthread+0x96/0xa0
[<ffffffff81003354>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-28 21:34           ` Sean Noonan
@ 2011-03-29  0:25             ` Michel Lespinasse
  2011-03-29  1:51             ` Dave Chinner
  2011-03-29 19:05             ` Sean Noonan
  2 siblings, 0 replies; 27+ messages in thread
From: Michel Lespinasse @ 2011-03-29  0:25 UTC (permalink / raw)
  To: Sean Noonan
  Cc: Christoph Hellwig, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm,
	Dave Chinner

On Mon, Mar 28, 2011 at 2:34 PM, Sean Noonan <Sean.Noonan@twosigma.com> wrote:
>> Could you test if you see the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
>
> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.

It seems that the test (without MAP_POPULATE) reveals that the root
cause is an xfs bug, which had been hidden up to now by MAP_POPULATE
preallocating disk blocks (but could always be triggered by the same
test without the MAP_POPULATE flag). I'm not sure how to go about
debugging the xfs deadlock; it would probably be best if an xfs person
could have a look ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-28 21:34           ` Sean Noonan
  2011-03-29  0:25             ` Michel Lespinasse
@ 2011-03-29  1:51             ` Dave Chinner
  2011-03-29  2:49               ` Sean Noonan
  2011-03-29 19:05             ` Sean Noonan
  2 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-29  1:51 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'Michel Lespinasse',
	Christoph Hellwig, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm

On Mon, Mar 28, 2011 at 05:34:09PM -0400, Sean Noonan wrote:
> > Could you test if you see the deadlock before
> > 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
> 
> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.
> 
> Here is the stack of the test:
> # cat /proc/3846/stack
> [<ffffffff812e8a64>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff81271c1d>] xfs_ilock+0x9d/0x110
> [<ffffffff81271cae>] xfs_ilock_map_shared+0x1e/0x50
> [<ffffffff81294985>] __xfs_get_blocks+0xc5/0x4e0
> [<ffffffff81294dcc>] xfs_get_blocks+0xc/0x10
> [<ffffffff811322c2>] do_mpage_readpage+0x462/0x660
> [<ffffffff8113250a>] mpage_readpage+0x4a/0x60
> [<ffffffff81295433>] xfs_vm_readpage+0x13/0x20
> [<ffffffff810bb850>] filemap_fault+0x2d0/0x4e0
> [<ffffffff810d8680>] __do_fault+0x50/0x510
> [<ffffffff810da542>] handle_mm_fault+0x1a2/0xe60
> [<ffffffff8102a466>] do_page_fault+0x146/0x440
> [<ffffffff8164e6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

Something else is holding the inode locked here.

> xfssyncd is stuck in D state.
> # cat /proc/2484/stack
> [<ffffffff8106ee1c>] down+0x3c/0x50
> [<ffffffff81297802>] xfs_buf_lock+0x72/0x170
> [<ffffffff8128762d>] xfs_getsb+0x1d/0x50
> [<ffffffff8128e6af>] xfs_trans_getsb+0x5f/0x150
> [<ffffffff8128821e>] xfs_mod_sb+0x4e/0xe0
> [<ffffffff8126e4ea>] xfs_fs_log_dummy+0x5a/0xb0
> [<ffffffff812a2a13>] xfs_sync_worker+0x83/0x90
> [<ffffffff812a28e2>] xfssyncd+0x172/0x220
> [<ffffffff81069576>] kthread+0x96/0xa0
> [<ffffffff81003354>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff

And this is indicating that something else is holding the superblock
locked here. IOWs, whatever thread is having trouble with memory
allocation is causing these threads to block and so they can be
ignored. What's the stack trace of the thread that is throwing the
"I can't allocating a page" errors?

As it is, the question I'd really like answered is how a machine with
48GB RAM can possibly be short of memory when running mmap() on a
16GB file.  The error that XFS is throwing indicates that the
machine cannot allocate a single page of memory, so where has all
your memory gone, and why hasn't the OOM killer been let off the
leash?  What is consuming the other 32GB of RAM or preventing it
from being allocated? 

Also, I was unable to reproduce this at all on a machine with only
2GB of RAM, regardless of the kernel version and/or MAP_POPULATE, so
I'm left to wonder what is special about your test system...

Perhaps the output of xfs_bmap -vvp <file> after a successful vs
deadlocked run would be instructive....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-29  1:51             ` Dave Chinner
@ 2011-03-29  2:49               ` Sean Noonan
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-29  2:49 UTC (permalink / raw)
  To: 'Dave Chinner'
  Cc: 'Michel Lespinasse',
	Christoph Hellwig, linux-kernel, Martin Bligh, Trammell Hudson,
	Christos Zoulas, linux-xfs, Stephen Degler, linux-mm

> As it is, the question I'd really like answered is how a machine with
> 48GB RAM can possibly be short of memory when running mmap() on a
> 16GB file.  The error that XFS is throwing indicates that the
> machine cannot allocate a single page of memory, so where has all
> your memory gone, and why hasn't the OOM killer been let off the
> leash?  What is consuming the other 32GB of RAM or preventing it
> from being allocated? 
Here's meminfo while a test was deadlocking.  As you can see, we certainly aren't running out of RAM.
# cat /proc/meminfo 
MemTotal:       49551548 kB
MemFree:        44139876 kB
Buffers:            5324 kB
Cached:          4970552 kB
SwapCached:            0 kB
Active:            52772 kB
Inactive:        4960624 kB
Active(anon):      37864 kB
Inactive(anon):        0 kB
Active(file):      14908 kB
Inactive(file):  4960624 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:           4914084 kB
Writeback:             0 kB
AnonPages:         37636 kB
Mapped:          4925460 kB
Shmem:               280 kB
Slab:             223212 kB
SReclaimable:     176280 kB
SUnreclaim:        46932 kB
KernelStack:        3968 kB
PageTables:        35228 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    47073968 kB
Committed_AS:      86556 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      380892 kB
VmallocChunk:   34331773836 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        2048 kB
DirectMap2M:     2086912 kB
DirectMap1G:    48234496 kB


> Perhaps the output of xfs_bmap -vvp <file> after a successful vs
deadlocked run would be instructive....

I will try to get this tomorrow.

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-28 21:34           ` Sean Noonan
  2011-03-29  0:25             ` Michel Lespinasse
  2011-03-29  1:51             ` Dave Chinner
@ 2011-03-29 19:05             ` Sean Noonan
  2011-03-29 19:24               ` 'Christoph Hellwig'
  2 siblings, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-29 19:05 UTC (permalink / raw)
  To: Sean Noonan, 'Michel Lespinasse'
  Cc: 'Christoph Hellwig',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

>> Could you test if you see the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?

> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.

git bisect leads to this:

bdfb04301fa5fdd95f219539a9a5b9663b1e5fc2 is the first bad commit
commit bdfb04301fa5fdd95f219539a9a5b9663b1e5fc2
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed Jan 20 21:55:30 2010 +0000

    xfs: replace KM_LARGE with explicit vmalloc use
    
    We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc
    if necessary.  As we only need this for a few boot/mount time
    allocations just switch to explicit vmalloc calls there.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alex Elder <aelder@sgi.com>

:040000 040000 1eed68ced17d8794fa842396c01c3b9677c6e709 d462932a318f8c823fa2a73156e980a688968cb2 M	fs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:05             ` Sean Noonan
@ 2011-03-29 19:24               ` 'Christoph Hellwig'
  2011-03-29 19:39                 ` Johannes Weiner
                                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: 'Christoph Hellwig' @ 2011-03-29 19:24 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'Michel Lespinasse', 'Christoph Hellwig',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

Can you check if the brute force patch below helps?  If it does I
still need to refine it a bit, but it could be that we are doing
an allocation under an xfs lock that could recurse back into the
filesystem.  We have a per-process flag to disable that for normal
kmalloc allocation, but we lost it for vmalloc in the commit you
bisected the regression to.


Index: xfs/fs/xfs/linux-2.6/kmem.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:16:58.039224236 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:17:08.368223598 +0200
@@ -63,7 +63,7 @@ static inline void *kmem_zalloc_large(si
 {
 	void *ptr;
 
-	ptr = vmalloc(size);
+	ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
 	if (ptr)
 		memset(ptr, 0, size);
 	return ptr;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:24               ` 'Christoph Hellwig'
@ 2011-03-29 19:39                 ` Johannes Weiner
  2011-03-29 19:43                   ` 'Christoph Hellwig'
  2011-03-29 19:46                 ` Sean Noonan
  2011-03-29 19:54                 ` Sean Noonan
  2 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2011-03-29 19:39 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: Sean Noonan, 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Tue, Mar 29, 2011 at 03:24:34PM -0400, 'Christoph Hellwig' wrote:
> Can you check if the brute force patch below helps?  If it does I
> still need to refine it a bit, but it could be that we are doing
> an allocation under an xfs lock that could recurse back into the
> filesystem.  We have a per-process flag to disable that for normal
> kmalloc allocation, but we lost it for vmalloc in the commit you
> bisected the regression to.
> 
> 
> Index: xfs/fs/xfs/linux-2.6/kmem.h
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:16:58.039224236 +0200
> +++ xfs/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:17:08.368223598 +0200
> @@ -63,7 +63,7 @@ static inline void *kmem_zalloc_large(si
>  {
>  	void *ptr;
>  
> -	ptr = vmalloc(size);
> +	ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
>  	if (ptr)
>  		memset(ptr, 0, size);
>  	return ptr;

Note that vmalloc is currently broken in that it does a GFP_KERNEL
allocation if it has to allocate page table pages, even when invoked
with GFP_NOFS:

	http://marc.info/?l=linux-mm&m=128942194520631&w=4


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:39                 ` Johannes Weiner
@ 2011-03-29 19:43                   ` 'Christoph Hellwig'
  0 siblings, 0 replies; 27+ messages in thread
From: 'Christoph Hellwig' @ 2011-03-29 19:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: 'Christoph Hellwig',
	Sean Noonan, 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Tue, Mar 29, 2011 at 09:39:07PM +0200, Johannes Weiner wrote:
> > -	ptr = vmalloc(size);
> > +	ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
> >  	if (ptr)
> >  		memset(ptr, 0, size);
> >  	return ptr;
> 
> Note that vmalloc is currently broken in that it does a GFP_KERNEL
> allocation if it has to allocate page table pages, even when invoked
> with GFP_NOFS:
> 
> 	http://marc.info/?l=linux-mm&m=128942194520631&w=4

Oh great.  In that case we had a chance to hit the deadlock even before
the offending commit, just a much smaller one.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:24               ` 'Christoph Hellwig'
  2011-03-29 19:39                 ` Johannes Weiner
@ 2011-03-29 19:46                 ` Sean Noonan
  2011-03-29 20:02                   ` 'Christoph Hellwig'
  2011-03-29 19:54                 ` Sean Noonan
  2 siblings, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-29 19:46 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

> Can you check if the brute force patch below helps?

No such luck.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:24               ` 'Christoph Hellwig'
  2011-03-29 19:39                 ` Johannes Weiner
  2011-03-29 19:46                 ` Sean Noonan
@ 2011-03-29 19:54                 ` Sean Noonan
  2011-03-30  0:09                   ` Dave Chinner
  2 siblings, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-29 19:54 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

> Can you check if the brute force patch below helps?  

Not sure if this helps at all, but here is the stack from all three processes involved.  This is without MAP_POPULATE and with the patch you just sent.

# ps aux | grep 'D[+]*[[:space:]]'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2314  0.2  0.0      0     0 ?        D    19:44   0:00 [flush-8:0]
root      2402  0.0  0.0      0     0 ?        D    19:44   0:00 [xfssyncd/sda9]
root      3861  2.6  9.9 16785280 4912848 pts/0 D+  19:45   0:07 ./vmtest /xfs/hugefile.dat 17179869184

# for p in 2314 2402 3861; do echo $p; cat /proc/$p/stack; done
2314
[<ffffffff810d634a>] congestion_wait+0x7a/0x130
[<ffffffff8129721c>] kmem_alloc+0x6c/0xf0
[<ffffffff8127c07e>] xfs_inode_item_format+0x36e/0x3b0
[<ffffffff8128401f>] xfs_log_commit_cil+0x4f/0x3b0
[<ffffffff8128ff31>] _xfs_trans_commit+0x1f1/0x2b0
[<ffffffff8127c716>] xfs_iomap_write_allocate+0x1a6/0x340
[<ffffffff81298883>] xfs_map_blocks+0x193/0x2c0
[<ffffffff812992fa>] xfs_vm_writepage+0x1ca/0x520
[<ffffffff810c4bd2>] __writepage+0x12/0x40
[<ffffffff810c53dd>] write_cache_pages+0x1dd/0x4f0
[<ffffffff810c573c>] generic_writepages+0x4c/0x70
[<ffffffff812986b8>] xfs_vm_writepages+0x58/0x70
[<ffffffff810c577c>] do_writepages+0x1c/0x40
[<ffffffff811247d1>] writeback_single_inode+0xf1/0x240
[<ffffffff81124edd>] writeback_sb_inodes+0xdd/0x1b0
[<ffffffff81125966>] writeback_inodes_wb+0x76/0x160
[<ffffffff81125d93>] wb_writeback+0x343/0x550
[<ffffffff81126126>] wb_do_writeback+0x186/0x2e0
[<ffffffff81126342>] bdi_writeback_thread+0xc2/0x310
[<ffffffff81067846>] kthread+0x96/0xa0
[<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
2402
[<ffffffff8106d0ec>] down+0x3c/0x50
[<ffffffff8129a7bd>] xfs_buf_lock+0x5d/0x170
[<ffffffff8128a87d>] xfs_getsb+0x1d/0x50
[<ffffffff81291bcf>] xfs_trans_getsb+0x5f/0x150
[<ffffffff8128b80e>] xfs_mod_sb+0x4e/0xe0
[<ffffffff81271dbf>] xfs_fs_log_dummy+0x4f/0x90
[<ffffffff812a61c1>] xfs_sync_worker+0x81/0x90
[<ffffffff812a6092>] xfssyncd+0x172/0x220
[<ffffffff81067846>] kthread+0x96/0xa0
[<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
3861
[<ffffffff812ec744>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff812754dd>] xfs_ilock+0x9d/0x110
[<ffffffff8127556e>] xfs_ilock_map_shared+0x1e/0x50
[<ffffffff81297c45>] __xfs_get_blocks+0xc5/0x4e0
[<ffffffff8129808c>] xfs_get_blocks+0xc/0x10
[<ffffffff81135ca2>] do_mpage_readpage+0x462/0x660
[<ffffffff81135eea>] mpage_readpage+0x4a/0x60
[<ffffffff812986e3>] xfs_vm_readpage+0x13/0x20
[<ffffffff810bd150>] filemap_fault+0x2d0/0x4e0
[<ffffffff810db0a0>] __do_fault+0x50/0x4f0
[<ffffffff810db85e>] handle_pte_fault+0x7e/0xc90
[<ffffffff810ddbf8>] handle_mm_fault+0x138/0x230
[<ffffffff8102b37c>] do_page_fault+0x12c/0x420
[<ffffffff81658fcf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:46                 ` Sean Noonan
@ 2011-03-29 20:02                   ` 'Christoph Hellwig'
  2011-03-29 20:23                     ` Sean Noonan
  2011-03-29 22:42                     ` Dave Chinner
  0 siblings, 2 replies; 27+ messages in thread
From: 'Christoph Hellwig' @ 2011-03-29 20:02 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'Christoph Hellwig', 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Tue, Mar 29, 2011 at 03:46:21PM -0400, Sean Noonan wrote:
> > Can you check if the brute force patch below helps?
> 
> No such luck.

Actually thinking about it - we never do the vmalloc under any fs lock,
so this can't be the reason.  But nothing else in the patch spring to
mind either, so to narrow this down does reverting the patch on
2.6.38 also fix it?  The revert isn't quite trivial due to changes
since then, so here's the patch I came up with:


Index: xfs/fs/xfs/linux-2.6/kmem.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.c	2011-03-29 21:55:12.871726512 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.c	2011-03-29 21:55:31.648723706 +0200
@@ -16,6 +16,7 @@
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 #include <linux/mm.h>
+#include <linux/vmalloc.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
@@ -25,25 +26,8 @@
 #include "kmem.h"
 #include "xfs_message.h"
 
-/*
- * Greedy allocation.  May fail and may return vmalloced memory.
- *
- * Must be freed using kmem_free_large.
- */
-void *
-kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
-{
-	void		*ptr;
-	size_t		kmsize = maxsize;
-
-	while (!(ptr = kmem_zalloc_large(kmsize))) {
-		if ((kmsize >>= 1) <= minsize)
-			kmsize = minsize;
-	}
-	if (ptr)
-		*size = kmsize;
-	return ptr;
-}
+#define MAX_VMALLOCS	6
+#define MAX_SLAB_SIZE	0x20000
 
 void *
 kmem_alloc(size_t size, unsigned int __nocast flags)
@@ -52,8 +36,19 @@ kmem_alloc(size_t size, unsigned int __n
 	gfp_t	lflags = kmem_flags_convert(flags);
 	void	*ptr;
 
+#ifdef DEBUG
+	if (unlikely(!(flags & KM_LARGE) && (size > PAGE_SIZE))) {
+		printk(KERN_WARNING "Large %s attempt, size=%ld\n",
+			__func__, (long)size);
+		dump_stack();
+	}
+#endif
+
 	do {
-		ptr = kmalloc(size, lflags);
+		if (size < MAX_SLAB_SIZE || retries > MAX_VMALLOCS)
+			ptr = kmalloc(size, lflags);
+		else
+			ptr = __vmalloc(size, lflags, PAGE_KERNEL);
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
 			return ptr;
 		if (!(++retries % 100))
@@ -75,6 +70,27 @@ kmem_zalloc(size_t size, unsigned int __
 	return ptr;
 }
 
+void *
+kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize,
+		   unsigned int __nocast flags)
+{
+	void		*ptr;
+	size_t		kmsize = maxsize;
+	unsigned int	kmflags = (flags & ~KM_SLEEP) | KM_NOSLEEP;
+
+	while (!(ptr = kmem_zalloc(kmsize, kmflags))) {
+		if ((kmsize <= minsize) && (flags & KM_NOSLEEP))
+			break;
+		if ((kmsize >>= 1) <= minsize) {
+			kmsize = minsize;
+			kmflags = flags;
+		}
+	}
+	if (ptr)
+		*size = kmsize;
+	return ptr;
+}
+
 void
 kmem_free(const void *ptr)
 {
Index: xfs/fs/xfs/linux-2.6/kmem.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:55:12.879725146 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.h	2011-03-29 21:55:31.652725467 +0200
@@ -21,7 +21,6 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
-#include <linux/vmalloc.h>
 
 /*
  * General memory allocation interfaces
@@ -31,6 +30,7 @@
 #define KM_NOSLEEP	0x0002u
 #define KM_NOFS		0x0004u
 #define KM_MAYFAIL	0x0008u
+#define KM_LARGE	0x0010u
 
 /*
  * We use a special process flag to avoid recursive callbacks into
@@ -42,7 +42,7 @@ kmem_flags_convert(unsigned int __nocast
 {
 	gfp_t	lflags;
 
-	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL));
+	BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_LARGE));
 
 	if (flags & KM_NOSLEEP) {
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
@@ -56,25 +56,10 @@ kmem_flags_convert(unsigned int __nocast
 
 extern void *kmem_alloc(size_t, unsigned int __nocast);
 extern void *kmem_zalloc(size_t, unsigned int __nocast);
+extern void *kmem_zalloc_greedy(size_t *, size_t, size_t, unsigned int __nocast);
 extern void *kmem_realloc(const void *, size_t, size_t, unsigned int __nocast);
 extern void  kmem_free(const void *);
 
-static inline void *kmem_zalloc_large(size_t size)
-{
-	void *ptr;
-
-	ptr = vmalloc(size);
-	if (ptr)
-		memset(ptr, 0, size);
-	return ptr;
-}
-static inline void kmem_free_large(void *ptr)
-{
-	vfree(ptr);
-}
-
-extern void *kmem_zalloc_greedy(size_t *, size_t, size_t);
-
 /*
  * Zone interfaces
  */
Index: xfs/fs/xfs/quota/xfs_qm.c
===================================================================
--- xfs.orig/fs/xfs/quota/xfs_qm.c	2011-03-29 21:55:12.859726589 +0200
+++ xfs/fs/xfs/quota/xfs_qm.c	2011-03-29 21:55:41.387278609 +0200
@@ -110,11 +110,12 @@ xfs_Gqm_init(void)
 	 */
 	udqhash = kmem_zalloc_greedy(&hsize,
 				     XFS_QM_HASHSIZE_LOW * sizeof(xfs_dqhash_t),
-				     XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t));
+				     XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t),
+				     KM_SLEEP | KM_MAYFAIL | KM_LARGE);
 	if (!udqhash)
 		goto out;
 
-	gdqhash = kmem_zalloc_large(hsize);
+	gdqhash = kmem_zalloc(hsize, KM_SLEEP | KM_LARGE);
 	if (!gdqhash)
 		goto out_free_udqhash;
 
@@ -171,7 +172,7 @@ xfs_Gqm_init(void)
 	return xqm;
 
  out_free_udqhash:
-	kmem_free_large(udqhash);
+	kmem_free(udqhash);
  out:
 	return NULL;
 }
@@ -194,8 +195,8 @@ xfs_qm_destroy(
 		xfs_qm_list_destroy(&(xqm->qm_usr_dqhtable[i]));
 		xfs_qm_list_destroy(&(xqm->qm_grp_dqhtable[i]));
 	}
-	kmem_free_large(xqm->qm_usr_dqhtable);
-	kmem_free_large(xqm->qm_grp_dqhtable);
+	kmem_free(xqm->qm_usr_dqhtable);
+	kmem_free(xqm->qm_grp_dqhtable);
 	xqm->qm_usr_dqhtable = NULL;
 	xqm->qm_grp_dqhtable = NULL;
 	xqm->qm_dqhashmask = 0;
Index: xfs/fs/xfs/xfs_itable.c
===================================================================
--- xfs.orig/fs/xfs/xfs_itable.c	2011-03-29 21:55:12.851725366 +0200
+++ xfs/fs/xfs/xfs_itable.c	2011-03-29 21:55:31.660724287 +0200
@@ -259,10 +259,8 @@ xfs_bulkstat(
 		(XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog);
 	nimask = ~(nicluster - 1);
 	nbcluster = nicluster >> mp->m_sb.sb_inopblog;
-	irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4);
-	if (!irbuf)
-		return ENOMEM;
-
+	irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4,
+				   KM_SLEEP | KM_MAYFAIL | KM_LARGE);
 	nirbuf = irbsize / sizeof(*irbuf);
 
 	/*
@@ -527,7 +525,7 @@ xfs_bulkstat(
 	/*
 	 * Done, we're either out of filesystem or space to put the data.
 	 */
-	kmem_free_large(irbuf);
+	kmem_free(irbuf);
 	*ubcountp = ubelem;
 	/*
 	 * Found some inodes, return them now and return the error next time.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-29 20:02                   ` 'Christoph Hellwig'
@ 2011-03-29 20:23                     ` Sean Noonan
  2011-03-29 22:42                     ` Dave Chinner
  1 sibling, 0 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-29 20:23 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

> mind either, so to narrow this down does reverting the patch on
> 2.6.38 also fix it?  The revert isn't quite trivial due to changes
> since then, so here's the patch I came up with:

This patch does fix the problem.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 20:02                   ` 'Christoph Hellwig'
  2011-03-29 20:23                     ` Sean Noonan
@ 2011-03-29 22:42                     ` Dave Chinner
  2011-03-29 22:45                       ` Sean Noonan
  2011-03-30  9:23                       ` 'Christoph Hellwig'
  1 sibling, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-29 22:42 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: Sean Noonan, Trammell Hudson, Christos Zoulas, Martin Bligh,
	'linux-kernel@vger.kernel.org',
	Stephen Degler, 'linux-mm@kvack.org',
	'linux-xfs@oss.sgi.com', 'Michel Lespinasse'

On Tue, Mar 29, 2011 at 04:02:56PM -0400, 'Christoph Hellwig' wrote:
> On Tue, Mar 29, 2011 at 03:46:21PM -0400, Sean Noonan wrote:
> > > Can you check if the brute force patch below helps?
> > 
> > No such luck.
> 
> Actually thinking about it - we never do the vmalloc under any fs lock,
> so this can't be the reason.  But nothing else in the patch spring to
> mind either, so to narrow this down does reverting the patch on
> 2.6.38 also fix it?  The revert isn't quite trivial due to changes
> since then, so here's the patch I came up with:
> 
> 
> Index: xfs/fs/xfs/linux-2.6/kmem.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/kmem.c	2011-03-29 21:55:12.871726512 +0200
> +++ xfs/fs/xfs/linux-2.6/kmem.c	2011-03-29 21:55:31.648723706 +0200
> @@ -16,6 +16,7 @@
>   * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>   */
>  #include <linux/mm.h>
> +#include <linux/vmalloc.h>
>  #include <linux/highmem.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
> @@ -25,25 +26,8 @@
>  #include "kmem.h"
>  #include "xfs_message.h"
>  
> -/*
> - * Greedy allocation.  May fail and may return vmalloced memory.
> - *
> - * Must be freed using kmem_free_large.
> - */
> -void *
> -kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> -{
> -	void		*ptr;
> -	size_t		kmsize = maxsize;
> -
> -	while (!(ptr = kmem_zalloc_large(kmsize))) {
> -		if ((kmsize >>= 1) <= minsize)
> -			kmsize = minsize;
> -	}
> -	if (ptr)
> -		*size = kmsize;
> -	return ptr;
> -}
> +#define MAX_VMALLOCS	6
> +#define MAX_SLAB_SIZE	0x20000

Why those values for the magic numbers?

....

> Index: xfs/fs/xfs/quota/xfs_qm.c
> ===================================================================
> --- xfs.orig/fs/xfs/quota/xfs_qm.c	2011-03-29 21:55:12.859726589 +0200
> +++ xfs/fs/xfs/quota/xfs_qm.c	2011-03-29 21:55:41.387278609 +0200
> @@ -110,11 +110,12 @@ xfs_Gqm_init(void)
>  	 */
>  	udqhash = kmem_zalloc_greedy(&hsize,
>  				     XFS_QM_HASHSIZE_LOW * sizeof(xfs_dqhash_t),
> -				     XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t));
> +				     XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t),
> +				     KM_SLEEP | KM_MAYFAIL | KM_LARGE);
>  	if (!udqhash)
>  		goto out;
>  
> -	gdqhash = kmem_zalloc_large(hsize);
> +	gdqhash = kmem_zalloc(hsize, KM_SLEEP | KM_LARGE);

Needs a KM_MAYFAIL as well?

>  	if (!gdqhash)
>  		goto out_free_udqhash;
>  
....
> Index: xfs/fs/xfs/xfs_itable.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_itable.c	2011-03-29 21:55:12.851725366 +0200
> +++ xfs/fs/xfs/xfs_itable.c	2011-03-29 21:55:31.660724287 +0200
> @@ -259,10 +259,8 @@ xfs_bulkstat(
>  		(XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog);
>  	nimask = ~(nicluster - 1);
>  	nbcluster = nicluster >> mp->m_sb.sb_inopblog;
> -	irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4);
> -	if (!irbuf)
> -		return ENOMEM;
> -
> +	irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4,
> +				   KM_SLEEP | KM_MAYFAIL | KM_LARGE);
>  	nirbuf = irbsize / sizeof(*irbuf);

Need to keep the if (!irbuf) check as KM_MAYFAIL is passed.

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-29 22:42                     ` Dave Chinner
@ 2011-03-29 22:45                       ` Sean Noonan
  2011-03-30  9:23                       ` 'Christoph Hellwig'
  1 sibling, 0 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-29 22:45 UTC (permalink / raw)
  To: 'Dave Chinner', 'Christoph Hellwig'
  Cc: Trammell Hudson, Christos Zoulas, Martin Bligh,
	'linux-kernel@vger.kernel.org',
	Stephen Degler, 'linux-mm@kvack.org',
	'linux-xfs@oss.sgi.com', 'Michel Lespinasse'

> Need to keep the if (!irbuf) check as KM_MAYFAIL is passed.

It wasn't in before the bug presented, so leaving it in wouldn't be a true test as to whether the bug has been tracked to the correct place.  I'll test again with the if (!irbuf).

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 19:54                 ` Sean Noonan
@ 2011-03-30  0:09                   ` Dave Chinner
  2011-03-30  1:32                     ` Sean Noonan
  2011-03-30  9:30                     ` 'Christoph Hellwig'
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-30  0:09 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'Christoph Hellwig', 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Tue, Mar 29, 2011 at 03:54:12PM -0400, Sean Noonan wrote:
> > Can you check if the brute force patch below helps?  
> 
> Not sure if this helps at all, but here is the stack from all three processes involved.  This is without MAP_POPULATE and with the patch you just sent.
> 
> # ps aux | grep 'D[+]*[[:space:]]'
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      2314  0.2  0.0      0     0 ?        D    19:44   0:00 [flush-8:0]
> root      2402  0.0  0.0      0     0 ?        D    19:44   0:00 [xfssyncd/sda9]
> root      3861  2.6  9.9 16785280 4912848 pts/0 D+  19:45   0:07 ./vmtest /xfs/hugefile.dat 17179869184
> 
> # for p in 2314 2402 3861; do echo $p; cat /proc/$p/stack; done
> 2314
> [<ffffffff810d634a>] congestion_wait+0x7a/0x130
> [<ffffffff8129721c>] kmem_alloc+0x6c/0xf0
> [<ffffffff8127c07e>] xfs_inode_item_format+0x36e/0x3b0
> [<ffffffff8128401f>] xfs_log_commit_cil+0x4f/0x3b0
> [<ffffffff8128ff31>] _xfs_trans_commit+0x1f1/0x2b0
> [<ffffffff8127c716>] xfs_iomap_write_allocate+0x1a6/0x340
> [<ffffffff81298883>] xfs_map_blocks+0x193/0x2c0
> [<ffffffff812992fa>] xfs_vm_writepage+0x1ca/0x520
> [<ffffffff810c4bd2>] __writepage+0x12/0x40
> [<ffffffff810c53dd>] write_cache_pages+0x1dd/0x4f0
> [<ffffffff810c573c>] generic_writepages+0x4c/0x70
> [<ffffffff812986b8>] xfs_vm_writepages+0x58/0x70
> [<ffffffff810c577c>] do_writepages+0x1c/0x40
> [<ffffffff811247d1>] writeback_single_inode+0xf1/0x240
> [<ffffffff81124edd>] writeback_sb_inodes+0xdd/0x1b0
> [<ffffffff81125966>] writeback_inodes_wb+0x76/0x160
> [<ffffffff81125d93>] wb_writeback+0x343/0x550
> [<ffffffff81126126>] wb_do_writeback+0x186/0x2e0
> [<ffffffff81126342>] bdi_writeback_thread+0xc2/0x310
> [<ffffffff81067846>] kthread+0x96/0xa0
> [<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff

So, it's trying to allocate a buffer for the inode extent list, so
should only be a couple of hundred bytes, and at most ~2kB if you
are using large inodes. That still doesn't seem like it should be
having memory allocation problems here with 44GB of free RAM....

Hmmmm. I wonder - the process is doing a random walk of 16GB, so
it's probably created tens of thousands of delayed allocation
extents before any real allocation was done. xfs_inode_item_format()
uses the in-core data fork size for the extent buffer allocation
which in this case would be much larger than what can possibly fit
inside the inode data fork.

Lets see - worst case is 8GB of sparse blocks, which is 2^21
delalloc blocks, which gives a worst case allocation size of 2^21 *
sizeof(struct xfs_bmbt_rec), which is roughly 64MB. Which would
overflow the return value. Even at 1k delalloc extents, we'll be
asking for an order-15 allocation when all we really need is an
order-0 allocation.

Ok, so that looks like root cause of the problem. can you try the
patch below to see if it fixes the problem (without any other
patches applied or reverted).

Cheers,,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: fix extent format buffer allocation size

From: Dave Chinner <dchinner@redhat.com>

When formatting an inode item, we have to allocate a separate buffer
to hold extents when there are delayed allocation extents on the
inode and it is in extent format. The allocation size is derived
from the in-core data fork representation, which accounts for
delayed allocation extents, while the on-disk representation does
not contain any delalloc extents.

As a result of this mismatch, the allocated buffer can be far larger
than needed to hold the real extent list which, due to the fact the
inode is in extent format, is limited to the size of the literal
area of the inode. However, we can have thousands of delalloc
extents, resulting in an allocation size orders of magnitude larger
than is needed to hold all the real extents.

Fix this by limiting the size of the buffer being allocated to the
size of the literal area of the inodes in the filesystem (i.e. the
maximum size an inode fork can grow to).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item.c |   69 ++++++++++++++++++++++++++++------------------
 1 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 46cc401..12cdc39 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -198,6 +198,43 @@ xfs_inode_item_size(
 }
 
 /*
+ * xfs_inode_item_format_extents - convert in-core extents to on-disk form
+ *
+ * For either the data or attr fork in extent format, we need to endian convert
+ * the in-core extent as we place them into the on-disk inode. In this case, we
+ * ned to do this conversion before we write the extents into the log. Because
+ * we don't have the disk inode to write into here, we allocate a buffer and
+ * format the extents into it via xfs_iextents_copy(). We free the buffer in
+ * the unlock routine after the copy for the log has been made.
+ *
+ * For the data fork, there can be delayed allocation extents
+ * in the inode as well, so the in-core data fork can be much larger than the
+ * on-disk data representation of real inodes. Hence we need to limit the size
+ * of the allocation to what will fit in the inode fork, otherwise we could be
+ * asking for excessively large allocation sizes.
+ */
+STATIC void
+xfs_inode_item_format_extents(
+	struct xfs_inode	*ip,
+	struct xfs_log_iovec	*vecp,
+	int			whichfork,
+	int			type)
+{
+	xfs_bmbt_rec_t		*ext_buffer;
+
+	ext_buffer = kmem_alloc(XFS_IFORK_SIZE(ip, whichfork),
+							KM_SLEEP | KM_NOFS);
+	if (whichfork == XFS_DATA_FORK)
+		ip->i_itemp->ili_extents_buf = ext_buffer;
+	else
+		ip->i_itemp->ili_aextents_buf = ext_buffer;
+
+	vecp->i_addr = ext_buffer;
+	vecp->i_len = xfs_iextents_copy(ip, ext_buffer, whichfork);
+	vecp->i_type = type;
+}
+
+/*
  * This is called to fill in the vector of log iovecs for the
  * given inode log item.  It fills the first item with an inode
  * log format structure, the second with the on-disk inode structure,
@@ -213,7 +250,6 @@ xfs_inode_item_format(
 	struct xfs_inode	*ip = iip->ili_inode;
 	uint			nvecs;
 	size_t			data_bytes;
-	xfs_bmbt_rec_t		*ext_buffer;
 	xfs_mount_t		*mp;
 
 	vecp->i_addr = &iip->ili_format;
@@ -320,22 +356,8 @@ xfs_inode_item_format(
 			} else
 #endif
 			{
-				/*
-				 * There are delayed allocation extents
-				 * in the inode, or we need to convert
-				 * the extents to on disk format.
-				 * Use xfs_iextents_copy()
-				 * to copy only the real extents into
-				 * a separate buffer.  We'll free the
-				 * buffer in the unlock routine.
-				 */
-				ext_buffer = kmem_alloc(ip->i_df.if_bytes,
-					KM_SLEEP);
-				iip->ili_extents_buf = ext_buffer;
-				vecp->i_addr = ext_buffer;
-				vecp->i_len = xfs_iextents_copy(ip, ext_buffer,
-						XFS_DATA_FORK);
-				vecp->i_type = XLOG_REG_TYPE_IEXT;
+				xfs_inode_item_format_extents(ip, vecp,
+					XFS_DATA_FORK, XLOG_REG_TYPE_IEXT);
 			}
 			ASSERT(vecp->i_len <= ip->i_df.if_bytes);
 			iip->ili_format.ilf_dsize = vecp->i_len;
@@ -445,19 +467,12 @@ xfs_inode_item_format(
 			 */
 			vecp->i_addr = ip->i_afp->if_u1.if_extents;
 			vecp->i_len = ip->i_afp->if_bytes;
+			vecp->i_type = XLOG_REG_TYPE_IATTR_EXT;
 #else
 			ASSERT(iip->ili_aextents_buf == NULL);
-			/*
-			 * Need to endian flip before logging
-			 */
-			ext_buffer = kmem_alloc(ip->i_afp->if_bytes,
-				KM_SLEEP);
-			iip->ili_aextents_buf = ext_buffer;
-			vecp->i_addr = ext_buffer;
-			vecp->i_len = xfs_iextents_copy(ip, ext_buffer,
-					XFS_ATTR_FORK);
+			xfs_inode_item_format_extents(ip, vecp,
+					XFS_ATTR_FORK, XLOG_REG_TYPE_IATTR_EXT);
 #endif
-			vecp->i_type = XLOG_REG_TYPE_IATTR_EXT;
 			iip->ili_format.ilf_asize = vecp->i_len;
 			vecp++;
 			nvecs++;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-30  0:09                   ` Dave Chinner
@ 2011-03-30  1:32                     ` Sean Noonan
  2011-03-30  1:44                       ` Dave Chinner
  2011-03-30  9:30                     ` 'Christoph Hellwig'
  1 sibling, 1 reply; 27+ messages in thread
From: Sean Noonan @ 2011-03-30  1:32 UTC (permalink / raw)
  To: 'Dave Chinner'
  Cc: 'Christoph Hellwig', 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

> Ok, so that looks like root cause of the problem. can you try the
> patch below to see if it fixes the problem (without any other
> patches applied or reverted).

It looks like this does fix the deadlock problem.  However, it appears to come at the price of significantly higher mmap startup costs.  

# ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 ))
/xfs/d-1/hugefile.dat: mapped 17179869184 bytes in 324387362198 ticks

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-30  1:32                     ` Sean Noonan
@ 2011-03-30  1:44                       ` Dave Chinner
  2011-03-30  1:52                         ` Sean Noonan
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-30  1:44 UTC (permalink / raw)
  To: Sean Noonan
  Cc: 'Christoph Hellwig', 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Tue, Mar 29, 2011 at 09:32:06PM -0400, Sean Noonan wrote:
> > Ok, so that looks like root cause of the problem. can you try the
> > patch below to see if it fixes the problem (without any other
> > patches applied or reverted).
> 
> It looks like this does fix the deadlock problem.  However, it
> appears to come at the price of significantly higher mmap startup
> costs.

It shouldn't make any difference to startup costs with the current
code uses read faults to populate the region and that doesn't cause
any allocation to occur and hence this code is not executed during
the populate phase.

Is this repeatable or is it just a one-off result?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: XFS memory allocation deadlock in 2.6.38
  2011-03-30  1:44                       ` Dave Chinner
@ 2011-03-30  1:52                         ` Sean Noonan
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Noonan @ 2011-03-30  1:52 UTC (permalink / raw)
  To: 'Dave Chinner'
  Cc: 'Christoph Hellwig', 'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

> Is this repeatable or is it just a one-off result?

It was repeated three times before I sent the email, but I can't reproduce it again now.  Call it a fluke.

Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-29 22:42                     ` Dave Chinner
  2011-03-29 22:45                       ` Sean Noonan
@ 2011-03-30  9:23                       ` 'Christoph Hellwig'
  1 sibling, 0 replies; 27+ messages in thread
From: 'Christoph Hellwig' @ 2011-03-30  9:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: 'Christoph Hellwig',
	Sean Noonan, Trammell Hudson, Christos Zoulas, Martin Bligh,
	'linux-kernel@vger.kernel.org',
	Stephen Degler, 'linux-mm@kvack.org',
	'linux-xfs@oss.sgi.com', 'Michel Lespinasse'

On Wed, Mar 30, 2011 at 09:42:30AM +1100, Dave Chinner wrote:
> > +#define MAX_VMALLOCS	6
> > +#define MAX_SLAB_SIZE	0x20000
> 
> Why those values for the magic numbers?

Ask the person who added it originall, it's just a revert to the
code before my commit to clean up our vmalloc usage.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: XFS memory allocation deadlock in 2.6.38
  2011-03-30  0:09                   ` Dave Chinner
  2011-03-30  1:32                     ` Sean Noonan
@ 2011-03-30  9:30                     ` 'Christoph Hellwig'
  1 sibling, 0 replies; 27+ messages in thread
From: 'Christoph Hellwig' @ 2011-03-30  9:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sean Noonan, 'Christoph Hellwig',
	'Michel Lespinasse',
	'linux-kernel@vger.kernel.org',
	Martin Bligh, Trammell Hudson, Christos Zoulas,
	'linux-xfs@oss.sgi.com',
	Stephen Degler, 'linux-mm@kvack.org'

On Wed, Mar 30, 2011 at 11:09:42AM +1100, Dave Chinner wrote:
> +	ext_buffer = kmem_alloc(XFS_IFORK_SIZE(ip, whichfork),
> +							KM_SLEEP | KM_NOFS);

The old code didn't use KM_NOFS, and I don't think it needed it either,
as we call the iop_format handlers inside the region covered by the
PF_FSTRANS flag.

Also I think the   routine needs to be under #ifndef XFS_NATIVE_HOST, as
we do not use it for big endian builds.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-03-30  9:30 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-21 16:19 XFS memory allocation deadlock in 2.6.38 Sean Noonan
2011-03-23 19:39 ` Sean Noonan
2011-03-24 17:43   ` Christoph Hellwig
2011-03-24 23:45     ` Michel Lespinasse
2011-03-28 14:58       ` Sean Noonan
2011-03-28 21:06         ` Michel Lespinasse
2011-03-28 21:34           ` Sean Noonan
2011-03-29  0:25             ` Michel Lespinasse
2011-03-29  1:51             ` Dave Chinner
2011-03-29  2:49               ` Sean Noonan
2011-03-29 19:05             ` Sean Noonan
2011-03-29 19:24               ` 'Christoph Hellwig'
2011-03-29 19:39                 ` Johannes Weiner
2011-03-29 19:43                   ` 'Christoph Hellwig'
2011-03-29 19:46                 ` Sean Noonan
2011-03-29 20:02                   ` 'Christoph Hellwig'
2011-03-29 20:23                     ` Sean Noonan
2011-03-29 22:42                     ` Dave Chinner
2011-03-29 22:45                       ` Sean Noonan
2011-03-30  9:23                       ` 'Christoph Hellwig'
2011-03-29 19:54                 ` Sean Noonan
2011-03-30  0:09                   ` Dave Chinner
2011-03-30  1:32                     ` Sean Noonan
2011-03-30  1:44                       ` Dave Chinner
2011-03-30  1:52                         ` Sean Noonan
2011-03-30  9:30                     ` 'Christoph Hellwig'
2011-03-27 18:11 ` Maciej Rutecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).