All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
@ 2015-06-23  8:18 Alex Lyakas
  2015-06-23 20:18 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2015-06-23  8:18 UTC (permalink / raw)
  To: xfs, hch


[-- Attachment #1.1: Type: text/plain, Size: 7604 bytes --]

Greetings,

We are hitting an issue with XFS printing messages like
“XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)”
and stack trace like in [1]. Eventually, hung-task panic kicks in with stack traces like [2].

We are running kernel 3.8.13. I see that in http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar issue has been discussed, but no code changes followed comparing to what we have in 3.8.13.

Any suggestion on how to move forward with this problem? For example, does this memory has to be really allocated with kmalloc (i.e., physically continuous) or vmalloc can be used?

Thanks,
Alex.

[1]
[109626.075483] nfsd            D 0000000000000002     0 20042      2 0x00000000
[109626.075483]  ffff88026ac3ef58 0000000000000046 ffff88031fffbd80 ffff88026ac40000
[109626.075483]  ffff88026ac3ffd8 ffff88026ac3ffd8 ffff88026ac3ffd8 0000000000013f40
[109626.075483]  ffff88030e542e80 ffff88026ac40000 ffff88030e58c000 ffff88026ac3ef90
[109626.075483] Call Trace:
[109626.075483]  [<ffffffff816ec509>] schedule+0x29/0x70
[109626.075483]  [<ffffffff816eabd0>] schedule_timeout+0x130/0x250
[109626.075483]  [<ffffffff8106a340>] ? cascade+0xa0/0xa0
[109626.075483]  [<ffffffff816ec8a2>] io_schedule_timeout+0xa2/0x100
[109626.075483]  [<ffffffff81185311>] ? __kmalloc+0x181/0x190
[109626.075483]  [<ffffffff81153bc0>] congestion_wait+0x80/0x120
[109626.075483]  [<ffffffff81185311>] ? __kmalloc+0x181/0x190
[109626.075483]  [<ffffffff8107fc10>] ? add_wait_queue+0x60/0x60
[109626.075483]  [<ffffffff81185260>] ? __kmalloc+0xd0/0x190
[109626.075483]  [<ffffffffa07631fc>] ? kmem_alloc+0x5c/0xe0 [xfs]
[109626.075483]  [<ffffffffa0763330>] ? kmem_realloc+0x30/0x70 [xfs]
[109626.075483]  [<ffffffffa0795f0d>] ? xfs_iext_realloc_indirect+0x3d/0x60 [xfs]
[109626.075483]  [<ffffffffa0795f6f>] ? xfs_iext_irec_new+0x3f/0x180 [xfs]
[109626.075483]  [<ffffffffa0796229>] ? xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
[109626.075483]  [<ffffffffa079662e>] ? xfs_iext_add+0xce/0x290 [xfs]
[109626.075483]  [<ffffffff81097c33>] ? update_curr+0x143/0x1f0
[109626.075483]  [<ffffffffa0796842>] ? xfs_iext_insert+0x52/0x100 [xfs]
[109626.075483]  [<ffffffffa0771b43>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
[109626.075483]  [<ffffffffa0771b43>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
[109626.075483]  [<ffffffffa07950d7>] ? xfs_iext_bno_to_ext+0xf7/0x160 [xfs]
[109626.075483]  [<ffffffffa0772389>] ? xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
[109626.075483]  [<ffffffffa07793b2>] ? xfs_bmapi_delay+0x122/0x270 [xfs]
[109626.075483]  [<ffffffffa0758703>] ? xfs_iomap_write_delay+0x173/0x320 [xfs]
[109626.075483]  [<ffffffffa077909c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
[109626.075483]  [<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
[109626.075483]  [<ffffffffa0745b40>] ? __xfs_get_blocks+0x280/0x550 [xfs]
[109626.075483]  [<ffffffffa0745e41>] ? xfs_get_blocks+0x11/0x20 [xfs]
[109626.075483]  [<ffffffff811cf77e>] ? __block_write_begin+0x1ae/0x4e0
[109626.075483]  [<ffffffffa0745e30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
[109626.075483]  [<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
[109626.075483]  [<ffffffffa074509f>] ? xfs_vm_write_begin+0x5f/0xe0 [xfs]
[109626.075483]  [<ffffffff8113552a>] ? generic_perform_write+0xca/0x210
[109626.075483]  [<ffffffff811356cd>] ? generic_file_buffered_write+0x5d/0x90
[109626.075483]  [<ffffffffa07502d5>] ? xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
[109626.075483]  [<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
[109626.075483]  [<ffffffffa075047c>] ? xfs_file_aio_write+0xfc/0x1b0 [xfs]
[109626.075483]  [<ffffffffa0750380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
[109626.075483]  [<ffffffff8119b8c3>] ? do_sync_readv_writev+0xa3/0xe0
[109626.075483]  [<ffffffff8119bb8d>] ? do_readv_writev+0xcd/0x1d0
[109626.075483]  [<ffffffff810877e0>] ? set_groups+0x40/0x60
[109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
[109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
[109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
[109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
[109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
[109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]
[109626.075483]  [<ffffffffa01b3d62>] ? nfsd_dispatch+0x102/0x270 [nfsd]
[109626.075483]  [<ffffffffa012bb48>] ? svc_process_common+0x328/0x5e0 [sunrpc]
[109626.075483]  [<ffffffffa012c153>] ? svc_process+0x103/0x160 [sunrpc]
[109626.075483]  [<ffffffffa01b372f>] ? nfsd+0xbf/0x130 [nfsd]
[109626.075483]  [<ffffffffa01b3670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[109626.075483]  [<ffffffff8107f050>] ? kthread+0xc0/0xd0
[109626.075483]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[109626.075483]  [<ffffffff816f61ec>] ? ret_from_fork+0x7c/0xb0
[109626.075483]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0

[2]
[87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
[87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87303.978012] nfsd            D 0000000000000003     0  5684      2 0x00000000
[87303.978017]  ffff8802506d37e8 0000000000000046 ffff880200000000 ffff880307ce3c9c
[87303.978020]  ffff8802506d3fd8 ffff8802506d3fd8 ffff8802506d3fd8 0000000000013f40
[87303.978023]  ffff88030e5445c0 ffff8802ca9945c0 ffff8802506d37c8 ffff88009dd8fd60
[87303.978026] Call Trace:
[87303.978036]  [<ffffffff816ec509>] schedule+0x29/0x70
[87303.978039]  [<ffffffff816ec7be>] schedule_preempt_disabled+0xe/0x10
[87303.978042]  [<ffffffff816eb437>] __mutex_lock_slowpath+0xd7/0x150
[87303.978045]  [<ffffffff816eb04a>] mutex_lock+0x2a/0x50
[87303.978076]  [<ffffffffa075a227>] xfs_file_buffered_aio_write+0x67/0x1c0 [xfs]
[87303.978089]  [<ffffffffa075a47c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
[87303.978101]  [<ffffffffa075a380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
[87303.978105]  [<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
[87303.978109]  [<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
[87303.978112]  [<ffffffff811afe51>] ? prepend_path+0xf1/0x1e0
[87303.978115]  [<ffffffff811856fc>] ? kmem_cache_alloc_trace+0x11c/0x140
[87303.978119]  [<ffffffff8130c425>] ? aa_alloc_task_context+0x35/0x50
[87303.978122]  [<ffffffff8119bccc>] vfs_writev+0x3c/0x50
[87303.978145]  [<ffffffffa0266dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
[87303.978149]  [<ffffffff816ed43e>] ? _raw_spin_lock+0xe/0x20
[87303.978159]  [<ffffffffa0284ba4>] ? find_confirmed_client.isra.58+0x144/0x1a0 [nfsd]
[87303.978167]  [<ffffffffa0284d48>] ? nfsd4_lookup_stateid+0xc8/0x120 [nfsd]
[87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
[87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
[87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0 [nfsd]
[87303.978197]  [<ffffffffa0262d62>] nfsd_dispatch+0x102/0x270 [nfsd]
[87303.978214]  [<ffffffffa01f3b48>] svc_process_common+0x328/0x5e0 [sunrpc]
[87303.978225]  [<ffffffffa01f4153>] svc_process+0x103/0x160 [sunrpc]
[87303.978232]  [<ffffffffa026272f>] nfsd+0xbf/0x130 [nfsd]
[87303.978238]  [<ffffffffa0262670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[87303.978243]  [<ffffffff8107f050>] kthread+0xc0/0xd0
[87303.978246]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[87303.978250]  [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
[87303.978253]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0

[-- Attachment #1.2: Type: text/html, Size: 10509 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-23  8:18 xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock" Alex Lyakas
@ 2015-06-23 20:18 ` Dave Chinner
  2015-06-27 21:01   ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-06-23 20:18 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: hch, xfs

On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> Greetings,
> 
> We are hitting an issue with XFS printing messages like “XFS:
> possible memory allocation deadlock in kmem_alloc
> (mode:0x250)” and stack trace like in [1]. Eventually,
> hung-task panic kicks in with stack traces like [2].
> 
> We are running kernel 3.8.13. I see that in
> http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> issue has been discussed, but no code changes followed comparing
> to what we have in 3.8.13.
> 
> Any suggestion on how to move forward with this problem? For
> example, does this memory has to be really allocated with kmalloc
> (i.e., physically continuous) or vmalloc can be used?

We left it alone because it is relatively rare for people to hit it,
and generally it indicates a severe fragmentation problem when they
do hit it (i.e. a file with millions of extents in it). Can you
track down the file that this is occurring against and see how many
extents it has?

i.e. you may be much better off by taking measures to avoid excessive
fragmentation than removing the canary from the mine...

> [109626.075483] nfsd            D 0000000000000002     0 20042      2 0x00000000

Hmmm - it's also a file written by the NFS server - this is is on an
a dedicated NFS server?

> [109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> [109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> [109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
> [109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> [109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> [109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]

Interesting that this is an NFSv3 write...

> [87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> [87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [87303.978012] nfsd            D 0000000000000003     0  5684      2 0x00000000
....
> [87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> [87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> [87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0 [nfsd]

And that is a NFsv4 write. You have multiple clients writing to the
same file using different versions of the NFS protocol?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-23 20:18 ` Dave Chinner
@ 2015-06-27 21:01   ` Alex Lyakas
  2015-06-28 18:19     ` Alex Lyakas
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alex Lyakas @ 2015-06-27 21:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Danny Shavit, Shyam Kaushik, Yair Hershko, hch, xfs

Hi Dave,

First, to answer your questions:
- this filesystem is accessed only through NFS, so I guess we can say this 
is a "dedicated" NFS server
- yes, both NFSv3 and NFSv4 are used (by different servers), and definitely 
they may attempt to access the same file

I did some further debugging by patching XFS to get some more info on how 
much memory is requested, and who requests it. I added a kmem_realloc 
variant that is called only by xfs_iext_realloc_indirect. This new function 
(kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number, which 
needs the memory. This is after some fiddling to check that "ifp" is the one 
embedded in xfs_inode and not the one allocated through kmem_cache (attr 
fork). After memory allocation fails for the first time, I also call 
handle_sysrq('m'). Finally, I print after how many retries the allocation 
succeeded. I also added some info to the normal kmem_alloc function.

Results are following:
- memory allocation failures happened only on the 
kmem_realloc_xfs_iext_realloc_indirect path for now
- XFS hits memory re-allocation failures when it needs to allocate about 
35KB. Sometimes allocation succeeds after few retries, but sometimes it 
takes several thousands of retries.
- All allocation failures happened on NFSv3 paths
- Three inode numbers were reported as failing memory allocations. After 
several hours, "find -inum" is still searching for these inodes...this is a 
huge filesystem... Is there any other quicker (XFS-specific?) way to find 
the file based on inode number?

Please see a typical allocation failure in [1]

Any recommendation how to move forward with this issue?

Additional observation that I saw in my local system: writing files to XFS 
locally vs writing the same files via NFS (both 3 and 4), the amount of 
extents reported by "xfs_bmap" is much higher for the NFS case. For example, 
creating a new file and writing into it as follows:
- write 4KB
- skip 4KB (i.e., lseek to 4KB + 4KB)
- write 4KB
- skip 4KB
...
Create a file of say 50MB this way.

Locally it ends up with very few (1-5) extents. But same exact workload 
through NFS results in several thousands of extents. The filesystem is 
mounted as "sync" in both cases. This is currently just an observation, 
which I will try to debug further. Are you perhaps familiar with such 
behavior? Can you perhaps try and check whether it also happens in your 
environment?

Thanks for your help,
Alex.



[1]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642 
kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm: 
nfsd Tainted: GF          O 3.8.13-030813-generic #201305111843
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450] 
[<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465] 
[<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482] 
[<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498] 
[<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516] 
[<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532] 
[<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539] 
[<ffffffff81076634>] ? wake_up_worker+0x24/0x30
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553] 
[<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569] 
[<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583] 
[<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598] 
[<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614] 
[<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628] 
[<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642] 
[<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656] 
[<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660] 
[<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665] 
[<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677] 
[<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688] 
[<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695] 
[<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705] 
[<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708] 
[<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720] 
[<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725] 
[<ffffffff8113552a>] generic_perform_write+0xca/0x210
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729] 
[<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740] 
[<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744] 
[<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758] 
[<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769] 
[<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773] 
[<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776] 
[<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780] 
[<ffffffff810877e0>] ? set_groups+0x40/0x60
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805] 
[<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808] 
[<ffffffff8119bccc>] vfs_writev+0x3c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815] 
[<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818] 
[<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825] 
[<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834] 
[<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841] 
[<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865] 
[<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876] 
[<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882] 
[<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888] 
[<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892] 
[<ffffffff8107f050>] kthread+0xc0/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895] 
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900] 
[<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903] 
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show Memory
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU    0: hi:    0, 
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU    1: hi:    0, 
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU    2: hi:    0, 
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU    3: hi:    0, 
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32 
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU    0: hi:  186, 
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU    1: hi:  186, 
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU    2: hi:  186, 
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU    3: hi:  186, 
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal 
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU    0: hi:  186, 
btch:  31 usd:  23
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU    1: hi:  186, 
btch:  31 usd:  25
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU    2: hi:  186, 
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU    3: hi:  186, 
btch:  31 usd:   7
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143 
inactive_anon:44181 isolated_anon:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  active_file:913373 
inactive_file:1464930 isolated_file:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  unevictable:8449 
dirty:6742 writeback:115 unstable:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free:159418 
slab_reclaimable:146857 slab_unreclaimable:66681
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  mapped:5561 
shmem:383 pagetables:2195 bounce:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free_cma:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA 
free:15884kB min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? 
yes
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]: 0 
3512 12080 12080
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32 
free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB 
shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]: 0 
0 8568 8568
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal 
free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB 
unevictable:33796kB isolated(anon):0kB isolated(file):0kB present:8773632kB
managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB 
mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]: 0 
0 0 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB 
(U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 
1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32: 
27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R) 
2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 
240440kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal: 
31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB 
(R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total 
pagecache pages
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap 
cache
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats: 
add 0, delete 0, find 0/0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap  = 
522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap 
=522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages 
reserved
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages 
shared
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages 
non-shared
...
34501    Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS: 
pid=12642 kmem_alloc success after 13443 retries inum=35646862875 size=35584 
flags=0x4 lflags=0x250

-----Original Message----- 
From: Dave Chinner
Sent: 23 June, 2015 10:18 PM
To: Alex Lyakas
Cc: xfs@oss.sgi.com ; hch@lst.de
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation
deadlock"

On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> Greetings,
>
> We are hitting an issue with XFS printing messages like “XFS:
> possible memory allocation deadlock in kmem_alloc
> (mode:0x250)” and stack trace like in [1]. Eventually,
> hung-task panic kicks in with stack traces like [2].
>
> We are running kernel 3.8.13. I see that in
> http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> issue has been discussed, but no code changes followed comparing
> to what we have in 3.8.13.
>
> Any suggestion on how to move forward with this problem? For
> example, does this memory has to be really allocated with kmalloc
> (i.e., physically continuous) or vmalloc can be used?

We left it alone because it is relatively rare for people to hit it,
and generally it indicates a severe fragmentation problem when they
do hit it (i.e. a file with millions of extents in it). Can you
track down the file that this is occurring against and see how many
extents it has?

i.e. you may be much better off by taking measures to avoid excessive
fragmentation than removing the canary from the mine...

> [109626.075483] nfsd            D 0000000000000002     0 20042      2
> 0x00000000

Hmmm - it's also a file written by the NFS server - this is is on an
a dedicated NFS server?

> [109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> [109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> [109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
> [nfsd]
> [109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> [109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> [109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]

Interesting that this is an NFSv3 write...

> [87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> [87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [87303.978012] nfsd            D 0000000000000003     0  5684      2
> 0x00000000
....
> [87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> [87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> [87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
> [nfsd]

And that is a NFsv4 write. You have multiple clients writing to the
same file using different versions of the NFS protocol?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-27 21:01   ` Alex Lyakas
@ 2015-06-28 18:19     ` Alex Lyakas
  2015-06-29 11:43     ` Brian Foster
  2015-06-29 22:26     ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2015-06-28 18:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Danny Shavit, Shyam Kaushik, Yair Hershko, hch, xfs

Hi Dave,
I have some more observations from further debugging.

1) I tracked down the problematic inodes using xfs_db (I still don't know 
their pathnames though). One of them is not problematic anymore (it seems 
like it got truncated, or perhaps deleted and inode number got reused). But 
two other inodes are huge files with hundreds of thousands of extents [1][2]

2) I am doing some experiments writing local files vs same workload through 
NFS. Trying the same 4KB-data/4KB-hole workload on small files of 2MB. When 
writing the file locally, I see that xfs_file_buffered_aio_write is always 
called with a single 4KB buffer:
xfs_file_buffered_aio_write: inum=100663559 nr_segs=1
seg #0: {.iov_base=0x18db8f0, .iov_len=4096}

But when doing the same workload through NFS:
xfs_file_buffered_aio_write: inum=167772423 nr_segs=2
seg #0: {.iov_base=0xffff88006c1100a8, .iov_len=3928}
seg #1: {.iov_base=0xffff88005556e000, .iov_len=168}
There are always two such buffers in the IOV.

Debugging further into nfsd4_decode_write, it appears that first some NFS 
headers come in the page, and the data starts after them, so additional page 
is required to fit the rest of 4KB of data.

I am still trying to debug why this results in XFS requiring much more 
extents to fit such workload. I tapped into some functions and seeing:

Local workload:
6    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x0 ext_diff=0x1, nextents=0 
new_size=16 if_bytes=0 if_real_bytes=0
25    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
44    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
63    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
82    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
101    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
120    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
139    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
158    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
177    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
196    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
215    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
234    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
253    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
272    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
291    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
So we always stay within inline extents.

NFS workload:
7    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x0 ext_diff=0x1, nextents=0 
new_size=16 if_bytes=0 if_real_bytes=0
31    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
61    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
88    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
115    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=1 
new_size=32 if_bytes=16 if_real_bytes=0
124    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=2 
new_size=48 if_bytes=32 if_real_bytes=0
130    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1, nextents=2 
new_size=48 if_bytes=32 if_real_bytes=0
136    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x2 ext_diff=0x1, nextents=3 
new_size=64 if_bytes=48 if_real_bytes=128
142    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x3 ext_diff=0x1, nextents=4 
new_size=80 if_bytes=64 if_real_bytes=64
154    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x5 ext_diff=0x1, nextents=5 
new_size=96 if_bytes=80 if_real_bytes=128
160    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x5 ext_diff=0x1, nextents=6 
new_size=112 if_bytes=96 if_real_bytes=128
166    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x6 ext_diff=0x1, nextents=6 
new_size=112 if_bytes=96 if_real_bytes=128
178    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x5 ext_diff=0x1, nextents=7 
new_size=128 if_bytes=112 if_real_bytes=128
193    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x6 ext_diff=0x1, nextents=8 
new_size=144 if_bytes=128 if_real_bytes=128
199    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x7 ext_diff=0x1, nextents=9 
new_size=160 if_bytes=144 if_real_bytes=256
205    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x8 ext_diff=0x1, 
nextents=10 new_size=176 if_bytes=160 if_real_bytes=256
217    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xa ext_diff=0x1, 
nextents=11 new_size=192 if_bytes=176 if_real_bytes=256
226    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xb ext_diff=0x1, 
nextents=11 new_size=192 if_bytes=176 if_real_bytes=256
244    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xb ext_diff=0x1, 
nextents=11 new_size=192 if_bytes=176 if_real_bytes=256
262    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xc ext_diff=0x1, 
nextents=12 new_size=208 if_bytes=192 if_real_bytes=256
271    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xb ext_diff=0x1, 
nextents=13 new_size=224 if_bytes=208 if_real_bytes=256
277    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xb ext_diff=0x1, 
nextents=13 new_size=224 if_bytes=208 if_real_bytes=256
295    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xd ext_diff=0x1, 
nextents=14 new_size=240 if_bytes=224 if_real_bytes=256
316    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=14 new_size=240 if_bytes=224 if_real_bytes=256
343    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=14 new_size=240 if_bytes=224 if_real_bytes=256
370    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=14 new_size=240 if_bytes=224 if_real_bytes=256
382    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=14 new_size=240 if_bytes=224 if_real_bytes=256
406    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=15 new_size=256 if_bytes=240 if_real_bytes=256
412    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xe ext_diff=0x1, 
nextents=15 new_size=256 if_bytes=240 if_real_bytes=256
418    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0xf ext_diff=0x1, 
nextents=16 new_size=272 if_bytes=256 if_real_bytes=256
424    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x10 ext_diff=0x1, 
nextents=17 new_size=288 if_bytes=272 if_real_bytes=512
430    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x11 ext_diff=0x1, 
nextents=18 new_size=304 if_bytes=288 if_real_bytes=512
439    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x13 ext_diff=0x1, 
nextents=19 new_size=320 if_bytes=304 if_real_bytes=512

Number of extents is growing. But still I could not see why this is 
happening. Can you please give a hint why?

3) I tried to see what is the largest file XFS can maintain with this 
4KB-data/4KB-hole workload on a VM with 5GB RAM. I was able to reach 146GB 
and almost 9M extents. There were a lof of "memory allocation deadlock" 
messages popping, but eventually allocation would succeed. Until finally, 
allocation could not succeed for 3 minutes and hung-task panic occurred. 
With a no-hole 32KB workload, XFS successfully created a 400GB file with 
92789 extents.

Thanks,
Alex.





[1]
xfs_db> inode 35741520897
xfs_db> p
core.magic = 0x494e
core.mode = 0100600
core.version = 2
core.format = 3 (btree)
core.nlinkv2 = 1
core.onlink = 0
core.projid_lo = 0
core.projid_hi = 0
core.uid = 0
core.gid = 0
core.flushiter = 454
core.atime.sec = Sat Jun 13 00:02:19 2015
core.atime.nsec = 698625000
core.mtime.sec = Tue Jun 16 02:36:42 2015
core.mtime.nsec = 786938999
core.ctime.sec = Sat Jun 27 03:21:14 2015
core.ctime.nsec = 899709747
core.size = 180676487417
core.nblocks = 44114144
core.extsize = 0
core.nextents = 928797
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.filestream = 0
core.gen = 1092231143
next_unlinked = null
u.bmbt.level = 3
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:2541992517

[2]
xfs_db> inode 35646862875
xfs_db> p
core.magic = 0x494e
core.mode = 0100600
core.version = 2
core.format = 3 (btree)
core.nlinkv2 = 1
core.onlink = 0
core.projid_lo = 0
core.projid_hi = 0
core.uid = 0
core.gid = 0
core.flushiter = 1321
core.atime.sec = Fri Jun 26 11:07:27 2015
core.atime.nsec = 642685048
core.mtime.sec = Fri Jun 26 14:20:57 2015
core.mtime.nsec = 840214048
core.ctime.sec = Fri Jun 26 14:21:07 2015
core.ctime.nsec = 616214048
core.size = 149676922744
core.nblocks = 36544482
core.extsize = 0
core.nextents = 572425
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.filestream = 0
core.gen = 3873914382
next_unlinked = null
u.bmbt.level = 2
u.bmbt.numrecs = 9
u.bmbt.keys[1-9] = [startoff] 1:[0] 2:[3764720] 3:[7291520] 4:[11373840] 
5:[16320112] 6:[20521147] 7:[23938933] 8:[28154576] 9:[32607104]
u.bmbt.ptrs[1-9] = 1:9885338478 2:9892028257 3:5911725342 4:10259977998 
5:9907385089 6:10056188649 7:10058512073 8:9917653582 9:9922453713






-----Original Message----- 
From: Alex Lyakas
Sent: 27 June, 2015 11:01 PM
To: Dave Chinner
Cc: xfs@oss.sgi.com ; hch@lst.de ; Yair Hershko ; Shyam Kaushik ; Danny 
Shavit
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

Hi Dave,

First, to answer your questions:
- this filesystem is accessed only through NFS, so I guess we can say this
is a "dedicated" NFS server
- yes, both NFSv3 and NFSv4 are used (by different servers), and definitely
they may attempt to access the same file

I did some further debugging by patching XFS to get some more info on how
much memory is requested, and who requests it. I added a kmem_realloc
variant that is called only by xfs_iext_realloc_indirect. This new function
(kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number, which
needs the memory. This is after some fiddling to check that "ifp" is the one
embedded in xfs_inode and not the one allocated through kmem_cache (attr
fork). After memory allocation fails for the first time, I also call
handle_sysrq('m'). Finally, I print after how many retries the allocation
succeeded. I also added some info to the normal kmem_alloc function.

Results are following:
- memory allocation failures happened only on the
kmem_realloc_xfs_iext_realloc_indirect path for now
- XFS hits memory re-allocation failures when it needs to allocate about
35KB. Sometimes allocation succeeds after few retries, but sometimes it
takes several thousands of retries.
- All allocation failures happened on NFSv3 paths
- Three inode numbers were reported as failing memory allocations. After
several hours, "find -inum" is still searching for these inodes...this is a
huge filesystem... Is there any other quicker (XFS-specific?) way to find
the file based on inode number?

Please see a typical allocation failure in [1]

Any recommendation how to move forward with this issue?

Additional observation that I saw in my local system: writing files to XFS
locally vs writing the same files via NFS (both 3 and 4), the amount of
extents reported by "xfs_bmap" is much higher for the NFS case. For example,
creating a new file and writing into it as follows:
- write 4KB
- skip 4KB (i.e., lseek to 4KB + 4KB)
- write 4KB
- skip 4KB
...
Create a file of say 50MB this way.

Locally it ends up with very few (1-5) extents. But same exact workload
through NFS results in several thousands of extents. The filesystem is
mounted as "sync" in both cases. This is currently just an observation,
which I will try to debug further. Are you perhaps familiar with such
behavior? Can you perhaps try and check whether it also happens in your
environment?

Thanks for your help,
Alex.



[1]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642
kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm:
nfsd Tainted: GF          O 3.8.13-030813-generic #201305111843
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450]
[<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465]
[<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482]
[<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498]
[<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516]
[<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532]
[<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539]
[<ffffffff81076634>] ? wake_up_worker+0x24/0x30
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553]
[<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569]
[<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583]
[<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598]
[<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614]
[<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628]
[<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642]
[<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656]
[<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660]
[<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665]
[<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677]
[<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688]
[<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695]
[<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705]
[<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708]
[<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720]
[<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725]
[<ffffffff8113552a>] generic_perform_write+0xca/0x210
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729]
[<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740]
[<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744]
[<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758]
[<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769]
[<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773]
[<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776]
[<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780]
[<ffffffff810877e0>] ? set_groups+0x40/0x60
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805]
[<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808]
[<ffffffff8119bccc>] vfs_writev+0x3c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815]
[<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818]
[<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825]
[<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834]
[<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841]
[<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865]
[<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876]
[<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882]
[<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888]
[<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892]
[<ffffffff8107f050>] kthread+0xc0/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895]
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900]
[<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903]
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show Memory
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU    0: hi:    0,
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU    1: hi:    0,
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU    2: hi:    0,
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU    3: hi:    0,
btch:   1 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU    0: hi:  186,
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU    1: hi:  186,
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU    2: hi:  186,
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU    3: hi:  186,
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU    0: hi:  186,
btch:  31 usd:  23
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU    1: hi:  186,
btch:  31 usd:  25
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU    2: hi:  186,
btch:  31 usd:   0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU    3: hi:  186,
btch:  31 usd:   7
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143
inactive_anon:44181 isolated_anon:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  active_file:913373
inactive_file:1464930 isolated_file:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  unevictable:8449
dirty:6742 writeback:115 unstable:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free:159418
slab_reclaimable:146857 slab_unreclaimable:66681
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  mapped:5561
shmem:383 pagetables:2195 bounce:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free_cma:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA
free:15884kB min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable?
yes
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]: 0
3512 12080 12080
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32
free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB
shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]: 0
0 8568 8568
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal
free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB
unevictable:33796kB isolated(anon):0kB isolated(file):0kB present:8773632kB
managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB
mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]: 0
0 0 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB
(U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32:
27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R)
2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
240440kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal:
31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB
(R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total
pagecache pages
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap
cache
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats:
add 0, delete 0, find 0/0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap  =
522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap
=522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages
reserved
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages
shared
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages
non-shared
...
34501    Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS:
pid=12642 kmem_alloc success after 13443 retries inum=35646862875 size=35584
flags=0x4 lflags=0x250

-----Original Message----- 
From: Dave Chinner
Sent: 23 June, 2015 10:18 PM
To: Alex Lyakas
Cc: xfs@oss.sgi.com ; hch@lst.de
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation
deadlock"

On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> Greetings,
>
> We are hitting an issue with XFS printing messages like “XFS:
> possible memory allocation deadlock in kmem_alloc
> (mode:0x250)” and stack trace like in [1]. Eventually,
> hung-task panic kicks in with stack traces like [2].
>
> We are running kernel 3.8.13. I see that in
> http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> issue has been discussed, but no code changes followed comparing
> to what we have in 3.8.13.
>
> Any suggestion on how to move forward with this problem? For
> example, does this memory has to be really allocated with kmalloc
> (i.e., physically continuous) or vmalloc can be used?

We left it alone because it is relatively rare for people to hit it,
and generally it indicates a severe fragmentation problem when they
do hit it (i.e. a file with millions of extents in it). Can you
track down the file that this is occurring against and see how many
extents it has?

i.e. you may be much better off by taking measures to avoid excessive
fragmentation than removing the canary from the mine...

> [109626.075483] nfsd            D 0000000000000002     0 20042      2
> 0x00000000

Hmmm - it's also a file written by the NFS server - this is is on an
a dedicated NFS server?

> [109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> [109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> [109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
> [nfsd]
> [109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> [109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> [109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]

Interesting that this is an NFSv3 write...

> [87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> [87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [87303.978012] nfsd            D 0000000000000003     0  5684      2
> 0x00000000
....
> [87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> [87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> [87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
> [nfsd]

And that is a NFsv4 write. You have multiple clients writing to the
same file using different versions of the NFS protocol?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-27 21:01   ` Alex Lyakas
  2015-06-28 18:19     ` Alex Lyakas
@ 2015-06-29 11:43     ` Brian Foster
  2015-06-29 17:59       ` Alex Lyakas
  2015-06-29 22:26     ` Dave Chinner
  2 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2015-06-29 11:43 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Shyam Kaushik, Yair Hershko, xfs, Danny Shavit, hch

On Sat, Jun 27, 2015 at 11:01:30PM +0200, Alex Lyakas wrote:
> Hi Dave,
> 
> First, to answer your questions:
> - this filesystem is accessed only through NFS, so I guess we can say this
> is a "dedicated" NFS server
> - yes, both NFSv3 and NFSv4 are used (by different servers), and definitely
> they may attempt to access the same file
> 
> I did some further debugging by patching XFS to get some more info on how
> much memory is requested, and who requests it. I added a kmem_realloc
> variant that is called only by xfs_iext_realloc_indirect. This new function
> (kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number, which
> needs the memory. This is after some fiddling to check that "ifp" is the one
> embedded in xfs_inode and not the one allocated through kmem_cache (attr
> fork). After memory allocation fails for the first time, I also call
> handle_sysrq('m'). Finally, I print after how many retries the allocation
> succeeded. I also added some info to the normal kmem_alloc function.
> 
> Results are following:
> - memory allocation failures happened only on the
> kmem_realloc_xfs_iext_realloc_indirect path for now
> - XFS hits memory re-allocation failures when it needs to allocate about
> 35KB. Sometimes allocation succeeds after few retries, but sometimes it
> takes several thousands of retries.
> - All allocation failures happened on NFSv3 paths
> - Three inode numbers were reported as failing memory allocations. After
> several hours, "find -inum" is still searching for these inodes...this is a
> huge filesystem... Is there any other quicker (XFS-specific?) way to find
> the file based on inode number?
> 
> Please see a typical allocation failure in [1]
> 
> Any recommendation how to move forward with this issue?
> 
> Additional observation that I saw in my local system: writing files to XFS
> locally vs writing the same files via NFS (both 3 and 4), the amount of
> extents reported by "xfs_bmap" is much higher for the NFS case. For example,
> creating a new file and writing into it as follows:
> - write 4KB
> - skip 4KB (i.e., lseek to 4KB + 4KB)
> - write 4KB
> - skip 4KB
> ...
> Create a file of say 50MB this way.
> 
> Locally it ends up with very few (1-5) extents. But same exact workload
> through NFS results in several thousands of extents. The filesystem is
> mounted as "sync" in both cases. This is currently just an observation,
> which I will try to debug further. Are you perhaps familiar with such
> behavior? Can you perhaps try and check whether it also happens in your
> environment?
> 

I believe this is usually the work of speculative preallocation[0]. It
basically should allocate more blocks than necessary for a file
extending allocation to avoid this kind of fragmentation. The
xfs_iomap_prealloc_size tracepoint was added to help observe behavior,
but it looks like that went in 3.10 and you're on 3.8. It was also more
recently disabled for files smaller than 64k, but again that happened in
3.11.

Anyways, it seems like perhaps it is working normally locally, but for
some reason not working appropriately over NFS..? It might be a good
idea to post xfs_info and mount options if you haven't.

Brian

[0] http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F

> Thanks for your help,
> Alex.
> 
> 
> 
> [1]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642
> kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm:
> nfsd Tainted: GF          O 3.8.13-030813-generic #201305111843
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450]
> [<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465]
> [<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482]
> [<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498]
> [<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516]
> [<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532]
> [<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539]
> [<ffffffff81076634>] ? wake_up_worker+0x24/0x30
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553]
> [<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569]
> [<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583]
> [<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598]
> [<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614]
> [<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628]
> [<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642]
> [<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656]
> [<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660]
> [<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665]
> [<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677]
> [<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688]
> [<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695]
> [<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705]
> [<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708]
> [<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720]
> [<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725]
> [<ffffffff8113552a>] generic_perform_write+0xca/0x210
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729]
> [<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740]
> [<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744]
> [<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758]
> [<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769]
> [<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773]
> [<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776]
> [<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780]
> [<ffffffff810877e0>] ? set_groups+0x40/0x60
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805]
> [<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808]
> [<ffffffff8119bccc>] vfs_writev+0x3c/0x50
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815]
> [<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818]
> [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825]
> [<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834]
> [<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841]
> [<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865]
> [<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876]
> [<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882]
> [<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888]
> [<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892]
> [<ffffffff8107f050>] kthread+0xc0/0xd0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895]
> [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900]
> [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903]
> [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show Memory
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU    0: hi:    0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU    1: hi:    0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU    2: hi:    0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU    3: hi:    0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32
> per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU    0: hi:  186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU    1: hi:  186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU    2: hi:  186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU    3: hi:  186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal
> per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU    0: hi:  186,
> btch:  31 usd:  23
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU    1: hi:  186,
> btch:  31 usd:  25
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU    2: hi:  186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU    3: hi:  186,
> btch:  31 usd:   7
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143
> inactive_anon:44181 isolated_anon:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  active_file:913373
> inactive_file:1464930 isolated_file:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  unevictable:8449
> dirty:6742 writeback:115 unstable:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free:159418
> slab_reclaimable:146857 slab_unreclaimable:66681
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  mapped:5561
> shmem:383 pagetables:2195 bounce:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free_cma:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA
> free:15884kB min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable?
> yes
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]: 0
> 3512 12080 12080
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32
> free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
> inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
> managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB
> shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
> kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]: 0
> 0 8568 8568
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal
> free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
> inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB
> unevictable:33796kB isolated(anon):0kB isolated(file):0kB present:8773632kB
> managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB
> mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
> slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB
> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]: 0
> 0 0 0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB
> (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32:
> 27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R)
> 2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
> 240440kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal:
> 31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB
> (R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total
> pagecache pages
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap
> cache
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats:
> add 0, delete 0, find 0/0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap  =
> 522236kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap
> =522236kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages
> reserved
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages
> shared
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages
> non-shared
> ...
> 34501    Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS:
> pid=12642 kmem_alloc success after 13443 retries inum=35646862875 size=35584
> flags=0x4 lflags=0x250
> 
> -----Original Message----- From: Dave Chinner
> Sent: 23 June, 2015 10:18 PM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com ; hch@lst.de
> Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation
> deadlock"
> 
> On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> >Greetings,
> >
> >We are hitting an issue with XFS printing messages like “XFS:
> >possible memory allocation deadlock in kmem_alloc
> >(mode:0x250)” and stack trace like in [1]. Eventually,
> >hung-task panic kicks in with stack traces like [2].
> >
> >We are running kernel 3.8.13. I see that in
> >http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> >issue has been discussed, but no code changes followed comparing
> >to what we have in 3.8.13.
> >
> >Any suggestion on how to move forward with this problem? For
> >example, does this memory has to be really allocated with kmalloc
> >(i.e., physically continuous) or vmalloc can be used?
> 
> We left it alone because it is relatively rare for people to hit it,
> and generally it indicates a severe fragmentation problem when they
> do hit it (i.e. a file with millions of extents in it). Can you
> track down the file that this is occurring against and see how many
> extents it has?
> 
> i.e. you may be much better off by taking measures to avoid excessive
> fragmentation than removing the canary from the mine...
> 
> >[109626.075483] nfsd            D 0000000000000002     0 20042      2
> >0x00000000
> 
> Hmmm - it's also a file written by the NFS server - this is is on an
> a dedicated NFS server?
> 
> >[109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> >[109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> >[109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
> >[nfsd]
> >[109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> >[109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> >[109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]
> 
> Interesting that this is an NFSv3 write...
> 
> >[87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> >[87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >this message.
> >[87303.978012] nfsd            D 0000000000000003     0  5684      2
> >0x00000000
> ....
> >[87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> >[87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> >[87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
> >[nfsd]
> 
> And that is a NFsv4 write. You have multiple clients writing to the
> same file using different versions of the NFS protocol?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-29 11:43     ` Brian Foster
@ 2015-06-29 17:59       ` Alex Lyakas
  2015-06-29 19:02         ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2015-06-29 17:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Shyam Kaushik, Yair Hershko, xfs, Danny Shavit, hch

Hi Brian,
Thanks for your comments.

Here is the information you asked for:

meta-data=/dev/dm-147            isize=256    agcount=67, agsize=268435440 
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=17825792000, imaxpct=5
         =                       sunit=16     swidth=160 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mount options:
/dev/dm-147 /export/nfsvol xfs 
rw,sync,noatime,wsync,attr2,discard,inode64,allocsize=64k,logbsize=64k,sunit=128,swidth=1280,noquota 
0 0

So yes, we are using "allocsize=64k", which influences the speculative 
allocation logic. I did various experiments, and indeed when I remove this 
"allocsize=64k", fragmentation is much lesser. (Tried also other things, 
like using a single nfsd thread, mounting without "sync" and patching nfsd 
to provide "nicer" IOV to vfs_write, but none of these helped). On the other 
hand, we started using this option "allocsize=64k" to prevent aggressive 
preallocation that we saw XFS doing on large QCOW files (VM images).

Still, when doing local IO to a mounted XFS, even with "allocsize=64k", we 
still get very few extents. Still don't know why is this difference between 
local IO and NFS. Would be great to receive a clue for that phenomena.



Thanks,
Alex.




-----Original Message----- 
From: Brian Foster
Sent: 29 June, 2015 1:43 PM
To: Alex Lyakas
Cc: Dave Chinner ; Danny Shavit ; Shyam Kaushik ; Yair Hershko ; hch@lst.de 
; xfs@oss.sgi.com
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

On Sat, Jun 27, 2015 at 11:01:30PM +0200, Alex Lyakas wrote:
> Hi Dave,
>
> First, to answer your questions:
> - this filesystem is accessed only through NFS, so I guess we can say this
> is a "dedicated" NFS server
> - yes, both NFSv3 and NFSv4 are used (by different servers), and 
> definitely
> they may attempt to access the same file
>
> I did some further debugging by patching XFS to get some more info on how
> much memory is requested, and who requests it. I added a kmem_realloc
> variant that is called only by xfs_iext_realloc_indirect. This new 
> function
> (kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number, 
> which
> needs the memory. This is after some fiddling to check that "ifp" is the 
> one
> embedded in xfs_inode and not the one allocated through kmem_cache (attr
> fork). After memory allocation fails for the first time, I also call
> handle_sysrq('m'). Finally, I print after how many retries the allocation
> succeeded. I also added some info to the normal kmem_alloc function.
>
> Results are following:
> - memory allocation failures happened only on the
> kmem_realloc_xfs_iext_realloc_indirect path for now
> - XFS hits memory re-allocation failures when it needs to allocate about
> 35KB. Sometimes allocation succeeds after few retries, but sometimes it
> takes several thousands of retries.
> - All allocation failures happened on NFSv3 paths
> - Three inode numbers were reported as failing memory allocations. After
> several hours, "find -inum" is still searching for these inodes...this is 
> a
> huge filesystem... Is there any other quicker (XFS-specific?) way to find
> the file based on inode number?
>
> Please see a typical allocation failure in [1]
>
> Any recommendation how to move forward with this issue?
>
> Additional observation that I saw in my local system: writing files to XFS
> locally vs writing the same files via NFS (both 3 and 4), the amount of
> extents reported by "xfs_bmap" is much higher for the NFS case. For 
> example,
> creating a new file and writing into it as follows:
> - write 4KB
> - skip 4KB (i.e., lseek to 4KB + 4KB)
> - write 4KB
> - skip 4KB
> ...
> Create a file of say 50MB this way.
>
> Locally it ends up with very few (1-5) extents. But same exact workload
> through NFS results in several thousands of extents. The filesystem is
> mounted as "sync" in both cases. This is currently just an observation,
> which I will try to debug further. Are you perhaps familiar with such
> behavior? Can you perhaps try and check whether it also happens in your
> environment?
>

I believe this is usually the work of speculative preallocation[0]. It
basically should allocate more blocks than necessary for a file
extending allocation to avoid this kind of fragmentation. The
xfs_iomap_prealloc_size tracepoint was added to help observe behavior,
but it looks like that went in 3.10 and you're on 3.8. It was also more
recently disabled for files smaller than 64k, but again that happened in
3.11.

Anyways, it seems like perhaps it is working normally locally, but for
some reason not working appropriately over NFS..? It might be a good
idea to post xfs_info and mount options if you haven't.

Brian

[0] http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F

> Thanks for your help,
> Alex.
>
>
>
> [1]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642
> kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm:
> nfsd Tainted: GF          O 3.8.13-030813-generic #201305111843
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450]
> [<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0 
> [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465]
> [<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80 
> [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482]
> [<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498]
> [<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516]
> [<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532]
> [<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539]
> [<ffffffff81076634>] ? wake_up_worker+0x24/0x30
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553]
> [<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569]
> [<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583]
> [<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598]
> [<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614]
> [<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628]
> [<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642]
> [<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656]
> [<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660]
> [<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665]
> [<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677]
> [<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688]
> [<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695]
> [<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705]
> [<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708]
> [<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720]
> [<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725]
> [<ffffffff8113552a>] generic_perform_write+0xca/0x210
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729]
> [<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740]
> [<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744]
> [<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758]
> [<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769]
> [<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773]
> [<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776]
> [<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780]
> [<ffffffff810877e0>] ? set_groups+0x40/0x60
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805]
> [<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808]
> [<ffffffff8119bccc>] vfs_writev+0x3c/0x50
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815]
> [<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818]
> [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825]
> [<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834]
> [<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841]
> [<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865]
> [<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876]
> [<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882]
> [<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888]
> [<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892]
> [<ffffffff8107f050>] kthread+0xc0/0xd0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895]
> [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900]
> [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903]
> [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show 
> Memory
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA 
> per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU    0: hi: 
> 0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU    1: hi: 
> 0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU    2: hi: 
> 0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU    3: hi: 
> 0,
> btch:   1 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32
> per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU    0: hi: 
> 186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU    1: hi: 
> 186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU    2: hi: 
> 186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU    3: hi: 
> 186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal
> per-cpu:
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU    0: hi: 
> 186,
> btch:  31 usd:  23
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU    1: hi: 
> 186,
> btch:  31 usd:  25
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU    2: hi: 
> 186,
> btch:  31 usd:   0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU    3: hi: 
> 186,
> btch:  31 usd:   7
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143
> inactive_anon:44181 isolated_anon:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] 
> active_file:913373
> inactive_file:1464930 isolated_file:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  unevictable:8449
> dirty:6742 writeback:115 unstable:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free:159418
> slab_reclaimable:146857 slab_unreclaimable:66681
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  mapped:5561
> shmem:383 pagetables:2195 bounce:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free_cma:0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA
> free:15884kB min:84kB low:104kB high:124kB active_anon:0kB 
> inactive_anon:0kB
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 
> all_unreclaimable?
> yes
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]: 
> 0
> 3512 12080 12080
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32
> free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
> inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
> managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB
> shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
> kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]: 
> 0
> 0 8568 8568
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal
> free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
> inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB
> unevictable:33796kB isolated(anon):0kB isolated(file):0kB 
> present:8773632kB
> managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB
> mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
> slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB
> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]: 
> 0
> 0 0 0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB
> (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32:
> 27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R)
> 2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
> 240440kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal:
> 31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB
> (R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total
> pagecache pages
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap
> cache
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats:
> add 0, delete 0, find 0/0
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap  =
> 522236kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap
> =522236kB
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages
> reserved
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages
> shared
> Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages
> non-shared
> ...
> 34501    Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS:
> pid=12642 kmem_alloc success after 13443 retries inum=35646862875 
> size=35584
> flags=0x4 lflags=0x250
>
> -----Original Message----- From: Dave Chinner
> Sent: 23 June, 2015 10:18 PM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com ; hch@lst.de
> Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory 
> allocation
> deadlock"
>
> On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> >Greetings,
> >
> >We are hitting an issue with XFS printing messages like “XFS:
> >possible memory allocation deadlock in kmem_alloc
> >(mode:0x250)” and stack trace like in [1]. Eventually,
> >hung-task panic kicks in with stack traces like [2].
> >
> >We are running kernel 3.8.13. I see that in
> >http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> >issue has been discussed, but no code changes followed comparing
> >to what we have in 3.8.13.
> >
> >Any suggestion on how to move forward with this problem? For
> >example, does this memory has to be really allocated with kmalloc
> >(i.e., physically continuous) or vmalloc can be used?
>
> We left it alone because it is relatively rare for people to hit it,
> and generally it indicates a severe fragmentation problem when they
> do hit it (i.e. a file with millions of extents in it). Can you
> track down the file that this is occurring against and see how many
> extents it has?
>
> i.e. you may be much better off by taking measures to avoid excessive
> fragmentation than removing the canary from the mine...
>
> >[109626.075483] nfsd            D 0000000000000002     0 20042      2
> >0x00000000
>
> Hmmm - it's also a file written by the NFS server - this is is on an
> a dedicated NFS server?
>
> >[109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> >[109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> >[109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
> >[nfsd]
> >[109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> >[109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> >[109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 
> >[nfsd]
>
> Interesting that this is an NFSv3 write...
>
> >[87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> >[87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >disables
> >this message.
> >[87303.978012] nfsd            D 0000000000000003     0  5684      2
> >0x00000000
> ....
> >[87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> >[87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> >[87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
> >[nfsd]
>
> And that is a NFsv4 write. You have multiple clients writing to the
> same file using different versions of the NFS protocol?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-29 17:59       ` Alex Lyakas
@ 2015-06-29 19:02         ` Brian Foster
  0 siblings, 0 replies; 17+ messages in thread
From: Brian Foster @ 2015-06-29 19:02 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Danny Shavit, Shyam Kaushik, Yair Hershko, hch, xfs

On Mon, Jun 29, 2015 at 07:59:00PM +0200, Alex Lyakas wrote:
> Hi Brian,
> Thanks for your comments.
> 
> Here is the information you asked for:
> 
> meta-data=/dev/dm-147            isize=256    agcount=67, agsize=268435440
> blks
>         =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=17825792000, imaxpct=5
>         =                       sunit=16     swidth=160 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>         =                       sectsz=512   sunit=16 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Mount options:
> /dev/dm-147 /export/nfsvol xfs rw,sync,noatime,wsync,attr2,discard,inode64,allocsize=64k,logbsize=64k,sunit=128,swidth=1280,noquota
> 0 0
> 
> So yes, we are using "allocsize=64k", which influences the speculative
> allocation logic. I did various experiments, and indeed when I remove this
> "allocsize=64k", fragmentation is much lesser. (Tried also other things,
> like using a single nfsd thread, mounting without "sync" and patching nfsd
> to provide "nicer" IOV to vfs_write, but none of these helped). On the other
> hand, we started using this option "allocsize=64k" to prevent aggressive
> preallocation that we saw XFS doing on large QCOW files (VM images).
> 

What was the problem with regard to preallocation and large VM images?
The preallocation is not permanent and should be cleaned up if the file
is inactive for a period of time (see the other prealloc FAQ entries).

> Still, when doing local IO to a mounted XFS, even with "allocsize=64k", we
> still get very few extents. Still don't know why is this difference between
> local IO and NFS. Would be great to receive a clue for that phenomena.
> 

What exactly is your test in this case? I assume you're also testing
with the same mount options and whatnot. One difference could be that
NFS might involve more open-write-close cycles than a local write test,
which could impact reclaim of preallocation. For example, what happens
if you run something like the following locally?

for i in $(seq 0 2 100); do
	xfs_io -fc "pwrite $((i * 4096)) 4k" /mnt/file
done

This will do the strided writes while opening and closing the file each
time and thus probably more closely matches what might be happening over
NFS. Prealloc is typically trimmed on close, but there is an NFS
specific heuristic that should detect this and let it hang around for
longer in this case. Taking a quick look at that code shows that it is
tied to the existence of delayed allocation blocks at close time,
however. I suppose that might never trigger due to the sync mount
option. What's the reason for using that one?

Brian

> 
> 
> Thanks,
> Alex.
> 
> 
> 
> 
> -----Original Message----- From: Brian Foster
> Sent: 29 June, 2015 1:43 PM
> To: Alex Lyakas
> Cc: Dave Chinner ; Danny Shavit ; Shyam Kaushik ; Yair Hershko ; hch@lst.de
> ; xfs@oss.sgi.com
> Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation
> deadlock"
> 
> On Sat, Jun 27, 2015 at 11:01:30PM +0200, Alex Lyakas wrote:
> >Hi Dave,
> >
> >First, to answer your questions:
> >- this filesystem is accessed only through NFS, so I guess we can say this
> >is a "dedicated" NFS server
> >- yes, both NFSv3 and NFSv4 are used (by different servers), and
> >definitely
> >they may attempt to access the same file
> >
> >I did some further debugging by patching XFS to get some more info on how
> >much memory is requested, and who requests it. I added a kmem_realloc
> >variant that is called only by xfs_iext_realloc_indirect. This new
> >function
> >(kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number,
> >which
> >needs the memory. This is after some fiddling to check that "ifp" is the
> >one
> >embedded in xfs_inode and not the one allocated through kmem_cache (attr
> >fork). After memory allocation fails for the first time, I also call
> >handle_sysrq('m'). Finally, I print after how many retries the allocation
> >succeeded. I also added some info to the normal kmem_alloc function.
> >
> >Results are following:
> >- memory allocation failures happened only on the
> >kmem_realloc_xfs_iext_realloc_indirect path for now
> >- XFS hits memory re-allocation failures when it needs to allocate about
> >35KB. Sometimes allocation succeeds after few retries, but sometimes it
> >takes several thousands of retries.
> >- All allocation failures happened on NFSv3 paths
> >- Three inode numbers were reported as failing memory allocations. After
> >several hours, "find -inum" is still searching for these inodes...this is
> >a
> >huge filesystem... Is there any other quicker (XFS-specific?) way to find
> >the file based on inode number?
> >
> >Please see a typical allocation failure in [1]
> >
> >Any recommendation how to move forward with this issue?
> >
> >Additional observation that I saw in my local system: writing files to XFS
> >locally vs writing the same files via NFS (both 3 and 4), the amount of
> >extents reported by "xfs_bmap" is much higher for the NFS case. For
> >example,
> >creating a new file and writing into it as follows:
> >- write 4KB
> >- skip 4KB (i.e., lseek to 4KB + 4KB)
> >- write 4KB
> >- skip 4KB
> >...
> >Create a file of say 50MB this way.
> >
> >Locally it ends up with very few (1-5) extents. But same exact workload
> >through NFS results in several thousands of extents. The filesystem is
> >mounted as "sync" in both cases. This is currently just an observation,
> >which I will try to debug further. Are you perhaps familiar with such
> >behavior? Can you perhaps try and check whether it also happens in your
> >environment?
> >
> 
> I believe this is usually the work of speculative preallocation[0]. It
> basically should allocate more blocks than necessary for a file
> extending allocation to avoid this kind of fragmentation. The
> xfs_iomap_prealloc_size tracepoint was added to help observe behavior,
> but it looks like that went in 3.10 and you're on 3.8. It was also more
> recently disabled for files smaller than 64k, but again that happened in
> 3.11.
> 
> Anyways, it seems like perhaps it is working normally locally, but for
> some reason not working appropriately over NFS..? It might be a good
> idea to post xfs_info and mount options if you haven't.
> 
> Brian
> 
> [0] http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F
> 
> >Thanks for your help,
> >Alex.
> >
> >
> >
> >[1]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642
> >kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm:
> >nfsd Tainted: GF          O 3.8.13-030813-generic #201305111843
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450]
> >[<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0
> >[xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465]
> >[<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80
> >[xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482]
> >[<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498]
> >[<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516]
> >[<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532]
> >[<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539]
> >[<ffffffff81076634>] ? wake_up_worker+0x24/0x30
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553]
> >[<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569]
> >[<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583]
> >[<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598]
> >[<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614]
> >[<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628]
> >[<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642]
> >[<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656]
> >[<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660]
> >[<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665]
> >[<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677]
> >[<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688]
> >[<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695]
> >[<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705]
> >[<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708]
> >[<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720]
> >[<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725]
> >[<ffffffff8113552a>] generic_perform_write+0xca/0x210
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729]
> >[<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740]
> >[<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744]
> >[<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758]
> >[<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769]
> >[<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773]
> >[<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776]
> >[<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780]
> >[<ffffffff810877e0>] ? set_groups+0x40/0x60
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805]
> >[<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808]
> >[<ffffffff8119bccc>] vfs_writev+0x3c/0x50
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815]
> >[<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818]
> >[<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825]
> >[<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834]
> >[<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841]
> >[<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865]
> >[<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876]
> >[<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882]
> >[<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888]
> >[<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892]
> >[<ffffffff8107f050>] kthread+0xc0/0xd0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895]
> >[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900]
> >[<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903]
> >[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show
> >Memory
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA
> >per-cpu:
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU    0: hi: 0,
> >btch:   1 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU    1: hi: 0,
> >btch:   1 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU    2: hi: 0,
> >btch:   1 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU    3: hi: 0,
> >btch:   1 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32
> >per-cpu:
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU    0: hi:
> >186,
> >btch:  31 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU    1: hi:
> >186,
> >btch:  31 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU    2: hi:
> >186,
> >btch:  31 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU    3: hi:
> >186,
> >btch:  31 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal
> >per-cpu:
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU    0: hi:
> >186,
> >btch:  31 usd:  23
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU    1: hi:
> >186,
> >btch:  31 usd:  25
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU    2: hi:
> >186,
> >btch:  31 usd:   0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU    3: hi:
> >186,
> >btch:  31 usd:   7
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143
> >inactive_anon:44181 isolated_anon:0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]
> >active_file:913373
> >inactive_file:1464930 isolated_file:0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  unevictable:8449
> >dirty:6742 writeback:115 unstable:0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free:159418
> >slab_reclaimable:146857 slab_unreclaimable:66681
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  mapped:5561
> >shmem:383 pagetables:2195 bounce:0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415]  free_cma:0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA
> >free:15884kB min:84kB low:104kB high:124kB active_anon:0kB
> >inactive_anon:0kB
> >active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> >isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB
> >writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> >slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> >bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> >all_unreclaimable?
> >yes
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]:
> >0
> >3512 12080 12080
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32
> >free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
> >inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB
> >unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
> >managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB
> >shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
> >kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB
> >writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]:
> >0
> >0 8568 8568
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal
> >free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
> >inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB
> >unevictable:33796kB isolated(anon):0kB isolated(file):0kB
> >present:8773632kB
> >managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB
> >mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
> >slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB
> >unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> >all_unreclaimable? no
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]:
> >0
> >0 0 0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB
> >(U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> >1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32:
> >27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R)
> >2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
> >240440kB
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal:
> >31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB
> >(R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total
> >pagecache pages
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap
> >cache
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats:
> >add 0, delete 0, find 0/0
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap  =
> >522236kB
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap
> >=522236kB
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages
> >reserved
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages
> >shared
> >Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages
> >non-shared
> >...
> >34501    Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS:
> >pid=12642 kmem_alloc success after 13443 retries inum=35646862875
> >size=35584
> >flags=0x4 lflags=0x250
> >
> >-----Original Message----- From: Dave Chinner
> >Sent: 23 June, 2015 10:18 PM
> >To: Alex Lyakas
> >Cc: xfs@oss.sgi.com ; hch@lst.de
> >Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory
> >allocation
> >deadlock"
> >
> >On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
> >>Greetings,
> >>
> >>We are hitting an issue with XFS printing messages like “XFS:
> >>possible memory allocation deadlock in kmem_alloc
> >>(mode:0x250)” and stack trace like in [1]. Eventually,
> >>hung-task panic kicks in with stack traces like [2].
> >>
> >>We are running kernel 3.8.13. I see that in
> >>http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
> >>issue has been discussed, but no code changes followed comparing
> >>to what we have in 3.8.13.
> >>
> >>Any suggestion on how to move forward with this problem? For
> >>example, does this memory has to be really allocated with kmalloc
> >>(i.e., physically continuous) or vmalloc can be used?
> >
> >We left it alone because it is relatively rare for people to hit it,
> >and generally it indicates a severe fragmentation problem when they
> >do hit it (i.e. a file with millions of extents in it). Can you
> >track down the file that this is occurring against and see how many
> >extents it has?
> >
> >i.e. you may be much better off by taking measures to avoid excessive
> >fragmentation than removing the canary from the mine...
> >
> >>[109626.075483] nfsd            D 0000000000000002     0 20042      2
> >>0x00000000
> >
> >Hmmm - it's also a file written by the NFS server - this is is on an
> >a dedicated NFS server?
> >
> >>[109626.075483]  [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
> >>[109626.075483]  [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
> >>[109626.075483]  [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
> >>[nfsd]
> >>[109626.075483]  [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
> >>[109626.075483]  [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
> >>[109626.075483]  [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140
> >>[nfsd]
> >
> >Interesting that this is an NFSv3 write...
> >
> >>[87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
> >>[87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>disables
> >>this message.
> >>[87303.978012] nfsd            D 0000000000000003     0  5684      2
> >>0x00000000
> >....
> >>[87303.978174]  [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
> >>[87303.978182]  [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
> >>[87303.978189]  [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
> >>[nfsd]
> >
> >And that is a NFsv4 write. You have multiple clients writing to the
> >same file using different versions of the NFS protocol?
> >
> >Cheers,
> >
> >Dave.
> >-- 
> >Dave Chinner
> >david@fromorbit.com
> >
> >_______________________________________________
> >xfs mailing list
> >xfs@oss.sgi.com
> >http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-27 21:01   ` Alex Lyakas
  2015-06-28 18:19     ` Alex Lyakas
  2015-06-29 11:43     ` Brian Foster
@ 2015-06-29 22:26     ` Dave Chinner
  2015-07-06 18:47       ` Alex Lyakas
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-06-29 22:26 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Danny Shavit, Shyam Kaushik, Yair Hershko, hch, xfs

[Compendium reply, top-posting removed, trimmed and re-ordered]

On Sat, Jun 27, 2015 at 11:01:30PM +0200, Alex Lyakas wrote:
> Results are following:
> - memory allocation failures happened only on the
> kmem_realloc_xfs_iext_realloc_indirect path for now
> - XFS hits memory re-allocation failures when it needs to allocate
> about 35KB. Sometimes allocation succeeds after few retries, but
> sometimes it takes several thousands of retries.

Allocations of 35kB are failing? Sounds like you have a serious
memory fragmentation problem if allocations that small are having
trouble.

> - All allocation failures happened on NFSv3 paths
> - Three inode numbers were reported as failing memory allocations.
> After several hours, "find -inum" is still searching for these
> inodes...this is a huge filesystem... Is there any other quicker
> (XFS-specific?) way to find the file based on inode number?

Not yet. You can use the bulkstat ioctl to find the inode by inode
number, then open-by-handle to get a fd for the inode to allow you
to read/write/stat/bmap/etc, but the only way to find the path right
now is to brute force it. That reverse mapping and parent pointer
stuff I'm working on at the moment will make lookups like this easy.

> Any recommendation how to move forward with this issue?
> 
> Additional observation that I saw in my local system: writing files
> to XFS locally vs writing the same files via NFS (both 3 and 4), the
> amount of extents reported by "xfs_bmap" is much higher for the NFS
> case. For example, creating a new file and writing into it as
> follows:
> - write 4KB
> - skip 4KB (i.e., lseek to 4KB + 4KB)
> - write 4KB
> - skip 4KB
> ...
> Create a file of say 50MB this way.
> 
> Locally it ends up with very few (1-5) extents. But same exact
> workload through NFS results in several thousands of extents.

NFS is likely resulting in out of order writes....

> The
> filesystem is mounted as "sync" in both cases.

I'm afraid to ask why, but that is likely your problem - synchronous
out of order writes from the NFS client will fragment the file
badly because it defeats both delayed allocation and speculative
preallocation because there is nothing to trigger the "don't remove
speculatieve prealloc on file close" heuristic used to avoid
fragmentation caused by out of order NFS writes....

On Sun, Jun 28, 2015 at 08:19:35PM +0200, Alex Lyakas wrote:
> through NFS. Trying the same 4KB-data/4KB-hole workload on small
> files of 2MB. When writing the file locally, I see that
> xfs_file_buffered_aio_write is always called with a single 4KB
> buffer:
> xfs_file_buffered_aio_write: inum=100663559 nr_segs=1
> seg #0: {.iov_base=0x18db8f0, .iov_len=4096}
> 
> But when doing the same workload through NFS:
> xfs_file_buffered_aio_write: inum=167772423 nr_segs=2
> seg #0: {.iov_base=0xffff88006c1100a8, .iov_len=3928}
> seg #1: {.iov_base=0xffff88005556e000, .iov_len=168}
> There are always two such buffers in the IOV.

IOV format is irrelevant to the buffered write behaviour of XFS.

> I am still trying to debug why this results in XFS requiring much
> more extents to fit such workload. I tapped into some functions and
> seeing:
> 
> Local workload:
> 6    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x0 ext_diff=0x1,
> nextents=0 new_size=16 if_bytes=0 if_real_bytes=0
> 25    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1,
.....

Sequential allocation, all nice and contiguous.
Preallocation is clearly not being removed between writes.

> NFS workload:
....
> nextents=1 new_size=32 if_bytes=16 if_real_bytes=0
> 124    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1,
> nextents=2 new_size=48 if_bytes=32 if_real_bytes=0
> 130    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1,

You're not getting sequential allocation, which further points to
problems with preallocation being removed on close.

> Number of extents is growing. But still I could not see why this is
> happening. Can you please give a hint why?

The sync mount option.

> 3) I tried to see what is the largest file XFS can maintain with
> this 4KB-data/4KB-hole workload on a VM with 5GB RAM. I was able to
> reach 146GB and almost 9M extents. There were a lof of "memory
> allocation deadlock" messages popping, but eventually allocation
> would succeed. Until finally, allocation could not succeed for 3
> minutes and hung-task panic occurred.

Well, yes. Each extent requires 32 bytes, plus an index page every
256 leaf pages (i.e. every 256*128=32k extents). So that extent list
requires roughly 300MB of memory, and a contiguous 270 page
allocation. vmalloc is not the answer here - it just papers over the
underlying problem: excessive fragmentation.

On Mon, Jun 29, 2015 at 03:02:23PM -0400, Brian Foster wrote:
> On Mon, Jun 29, 2015 at 07:59:00PM +0200, Alex Lyakas wrote:
> > Hi Brian,
> > Thanks for your comments.
> > 
> > Here is the information you asked for:
> > 
> > meta-data=/dev/dm-147            isize=256    agcount=67, agsize=268435440
> > blks
> >         =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=17825792000, imaxpct=5
> >         =                       sunit=16     swidth=160 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal               bsize=4096   blocks=521728, version=2
> >         =                       sectsz=512   sunit=16 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > Mount options:
> > /dev/dm-147 /export/nfsvol xfs rw,sync,noatime,wsync,attr2,discard,inode64,allocsize=64k,logbsize=64k,sunit=128,swidth=1280,noquota
> > 0 0
> > 
> > So yes, we are using "allocsize=64k", which influences the speculative
> > allocation logic. I did various experiments, and indeed when I remove this
> > "allocsize=64k", fragmentation is much lesser. (Tried also other things,
> > like using a single nfsd thread, mounting without "sync" and patching nfsd
> > to provide "nicer" IOV to vfs_write, but none of these helped). On the other
> > hand, we started using this option "allocsize=64k" to prevent aggressive
> > preallocation that we saw XFS doing on large QCOW files (VM images).
> > 
> 
> What was the problem with regard to preallocation and large VM images?
> The preallocation is not permanent and should be cleaned up if the file
> is inactive for a period of time (see the other prealloc FAQ entries).

A lot of change went into the speculative preallocation in the
kernels after 3.8, so I suspect we've already fixed whatever problem
was seen. Alex, it would be a good idea to try to reproduce those
problems on a current kernel to see if they still are present....

> > Still, when doing local IO to a mounted XFS, even with "allocsize=64k", we
> > still get very few extents. Still don't know why is this difference between
> > local IO and NFS. Would be great to receive a clue for that phenomena.
> > 
> 
> What exactly is your test in this case? I assume you're also testing
> with the same mount options and whatnot. One difference could be that
> NFS might involve more open-write-close cycles than a local write test,
> which could impact reclaim of preallocation. For example, what happens
> if you run something like the following locally?
> 
> for i in $(seq 0 2 100); do
> 	xfs_io -fc "pwrite $((i * 4096)) 4k" /mnt/file
> done

That should produce similar results to run the NFS client. Years ago
back at SGI we used a tool written by Greg Banks called "Ddnfs" for
testing this sort of thing. it did open_by_handle()/close() around
each read/write syscall to emulate the NFS server IO pattern.

http://oss.sgi.com/projects/nfs/testtools/ddnfs-oss-20090302.tar.bz2

> 
> This will do the strided writes while opening and closing the file each
> time and thus probably more closely matches what might be happening over
> NFS. Prealloc is typically trimmed on close, but there is an NFS
> specific heuristic that should detect this and let it hang around for
> longer in this case. Taking a quick look at that code shows that it is
> tied to the existence of delayed allocation blocks at close time,
> however. I suppose that might never trigger due to the sync mount
> option. What's the reason for using that one?

Right - it won't trigger because writeback occurs in the write()
context, so we have a clean inode when the fd is closed and
->release is called...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-06-29 22:26     ` Dave Chinner
@ 2015-07-06 18:47       ` Alex Lyakas
  2015-07-07  0:09         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2015-07-06 18:47 UTC (permalink / raw)
  To: Dave Chinner, bfoster; +Cc: Danny Shavit, Shyam Kaushik, Yair Hershko, xfs

Hi Dave, Brian,

[Compendium reply, trimmed and re-ordered]


> What was the problem with regard to preallocation and large VM images?
> The preallocation is not permanent and should be cleaned up if the file
> is inactive for a period of time (see the other prealloc FAQ entries).
The problem was that in 3.8 speculative preallocation was based on the inode 
size. So when creating large sparse files (for example with qemu-img), XFS 
was writing huge amounts of data through xfs_iozero, which choked the 
drives. As Dave pointed out, this was fixed in later kernels.

> For example, what happens
> if run something like the following locally?
> for i in $(seq 0 2 100); do
> xfs_io -fc "pwrite $((i * 4096)) 4k" /mnt/file
> done
When running this locally, speculative preallocation is trimmed through 
xfs_free_eofblocks (verified with systemtap), and indeed we get a highly 
fragmented file.
However, debugging our NFS workload, we see that this is not happening, 
i.e., NFS server does not issue ->release, until the end of the workload.

> I suppose that might never trigger due to the sync mount
> option. What's the reason for using that one?

> I'm afraid to ask why, but that is likely your problem - synchronous
> out of order writes from the NFS client will fragment the file
> badly because it defeats both delayed allocation and speculative
> preallocation because there is nothing to trigger the "don't remove
> speculatieve prealloc on file close" heuristic used to avoid
> fragmentation caused by out of order NFS writes....
The main reason for using "sync" mount option is to avoid data loss in the 
case of crash.
I did some experiments without this mount option, and indeed I see that same 
NFS workload results in lower fragmentation, especially for large files. 
However, since we do not consider at the moment removing the "sync" mount 
option, I did not debug further why it happens.

> NFS is likely resulting in out of order writes....
Yes, Dave, this appeared to be our issue. This in addition to badly 
configured NFS client, which had:
rsize=32768,wsize=32768
instead of what we usually see:
rsize=1048576,wsize=1048576

An out of order write was triggering a small speculative preallocation 
(allocsize=64k), and all subsequent writes into the "hole" were not able to 
benefit from it, and had to allocate separate extents (which most of the 
time were not physically contiguous). And NFS server receiving 32k writes 
contributed even more to the fragmentation. With 1MB writes this problem 
doesn't really happen even with allocsize=64k.

So currently, we are pulling the following XFS patches:
xfs: don't use speculative prealloc for small files
xfs: fix xfs_iomap_eof_prealloc_initial_size type
xfs: increase prealloc size to double that of the previous extent
xfs: fix potential infinite loop in xfs_iomap_prealloc_size()
xfs: limit speculative prealloc size on sparse files

(Final code will be as usual in 
https://github.com/zadarastorage/zadara-xfs-pushback)

However, Dave, I am still not comfortable with XFS insisting on continuous 
space for the data fork in kmem_alloc. Consider, for example,  Brian's 
script. Nothing stops the user from doing that. Another example could be 
strided 4k NFS writes coming out of order. For these cases, speculative 
preallocation will not help, as we will receive a highly fragmented file 
with holes.

Another example, Dave, can you please look at the stack trace in [1]. (It 
doesn't make much sense, but this is what we got). Could something like this 
happen:
- VFS tells XFS to unlink an inode
- XFS tries to reallocate the extents fork via xfs_inactive path
- there is no continuous memory, so kernel (somehow) wants to evict the same 
inode, but cannot lock it due to XFS already holding the lock???
I know that this is very far-fetched, and probably wrong, but insisting on 
continuous memory is also problematic here.

Thanks for your help Brian & Dave,
Alex.


[1]
454509.864025] nfsd            D 0000000000000001     0   797      2 
0x00000000
[454509.864025]  ffff88036e41d438 0000000000000046 ffff88037b351c00 
ffff88017fb22a20
[454509.864025]  ffff88036e41dfd8 0000000000000000 0000000000000008 
ffff8803aca2dd58
[454509.864025]  ffff88036e41d448 ffffffffa074905d 000000012e32b040 
ffff8803aca2dcc0
[454509.864025] Call Trace:
[454509.864025]  [<ffffffffa0748e94>] ? xfs_buf_lock+0x44/0x110 [xfs]
[454509.864025]  [<ffffffffa074905d>] ? _xfs_buf_find+0xfd/0x2a0 [xfs]
[454509.864025]  [<ffffffffa07492d4>] ? xfs_buf_get_map+0x34/0x1b0 [xfs]
[454509.864025]  [<ffffffffa074a261>] ? xfs_buf_read_map+0x31/0x130 [xfs]
[454509.864025]  [<ffffffffa07acc39>] ? xfs_trans_read_buf_map+0x2d9/0x490 
[xfs]
[454509.864025]  [<ffffffffa077e572>] ? 
xfs_btree_read_buf_block.isra.20.constprop.25+0x72/0xb0 [xfs]
[454509.864025]  [<ffffffffa0780a3c>] ? xfs_btree_rshift+0xcc/0x540 [xfs]
[454509.864025]  [<ffffffffa0749a84>] ? _xfs_buf_ioapply+0x294/0x300 [xfs]
[454509.864025]  [<ffffffffa0782bf8>] ? 
xfs_btree_make_block_unfull+0x58/0x190 [xfs]
[454509.864025]  [<ffffffffa074a210>] ? _xfs_buf_read+0x30/0x50 [xfs]
[454509.864025]  [<ffffffffa0749be9>] ? xfs_buf_iorequest+0x69/0xd0 [xfs]
[454509.864025]  [<ffffffffa07830b7>] ? xfs_btree_insrec+0x387/0x580 [xfs]
[454509.864025]  [<ffffffffa074a333>] ? xfs_buf_read_map+0x103/0x130 [xfs]
[454509.864025]  [<ffffffffa074a3bb>] ? xfs_buf_readahead_map+0x5b/0x80 
[xfs]
[454509.864025]  [<ffffffffa077e62b>] ? xfs_btree_lookup_get_block+0x7b/0xe0 
[xfs]
[454509.864025]  [<ffffffffa077d88f>] ? xfs_btree_ptr_offset+0x4f/0x70 [xfs]
[454509.864025]  [<ffffffffa077d8e2>] ? xfs_btree_key_addr+0x12/0x20 [xfs]
[454509.864025]  [<ffffffffa07822d7>] ? xfs_btree_lookup+0xb7/0x470 [xfs]
[454509.864025]  [<ffffffffa0764deb>] ? xfs_alloc_lookup_eq+0x1b/0x20 [xfs]
[454509.864025]  [<ffffffffa0765dd1>] ? xfs_free_ag_extent+0x421/0x940 [xfs]
[454509.864025]  [<ffffffffa07689fa>] ? xfs_free_extent+0x10a/0x170 [xfs]
[454509.864025]  [<ffffffffa07795c9>] ? xfs_bmap_finish+0x169/0x1b0 [xfs]
[454509.864025]  [<ffffffffa07956a3>] ? xfs_itruncate_extents+0xf3/0x2d0 
[xfs]
[454509.864025]  [<ffffffffa0764767>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
[454509.864025]  [<ffffffffa0762180>] ? xfs_inactive+0x340/0x450 [xfs]
[454509.864025]  [<ffffffff816ed725>] ? _raw_spin_lock_irq+0x15/0x20
[454509.864025]  [<ffffffffa075e303>] ? xfs_fs_evict_inode+0x93/0x100 [xfs]
[454509.864025]  [<ffffffff811b5530>] ? evict+0xc0/0x1d0
[454509.864025]  [<ffffffff811b5e62>] ? iput_final+0xe2/0x170
[454509.864025]  [<ffffffff811b5f2e>] ? iput+0x3e/0x50
[454509.864025]  [<ffffffff811b0e88>] ? dentry_unlink_inode+0xd8/0x110
[454509.864025]  [<ffffffff811b0f7e>] ? d_delete+0xbe/0xd0
[454509.864025]  [<ffffffff811a663e>] ? vfs_unlink.part.27+0xde/0xf0
[454509.864025]  [<ffffffff811a847c>] ? vfs_unlink+0x3c/0x60
[454509.864025]  [<ffffffffa01e90c3>] ? nfsd_unlink+0x183/0x230 [nfsd]
[454509.864025]  [<ffffffffa01f871d>] ? nfsd4_remove+0x6d/0x130 [nfsd]
[454509.864025]  [<ffffffffa01f746c>] ? nfsd4_proc_compound+0x5ac/0x7a0 
[nfsd]
[454509.864025]  [<ffffffffa01e2d62>] ? nfsd_dispatch+0x102/0x270 [nfsd]
[454509.864025]  [<ffffffffa013cb48>] ? svc_process_common+0x328/0x5e0 
[sunrpc]
[454509.864025]  [<ffffffffa013d153>] ? svc_process+0x103/0x160 [sunrpc]
[454509.864025]  [<ffffffffa01e272f>] ? nfsd+0xbf/0x130 [nfsd]
[454509.864025]  [<ffffffffa01e2670>] ? nfsd_destroy+0x80/0x80 [nfsd]
[454509.864025]  [<ffffffff8107f050>] ? kthread+0xc0/0xd0
[454509.864025]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[454509.864025]  [<ffffffff816f61ec>] ? ret_from_fork+0x7c/0xb0
[454509.864025]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0






-----Original Message----- 
From: Dave Chinner
Sent: 30 June, 2015 12:26 AM
To: Alex Lyakas
Cc: xfs@oss.sgi.com ; hch@lst.de ; Yair Hershko ; Shyam Kaushik ; Danny 
Shavit
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

[Compendium reply, top-posting removed, trimmed and re-ordered]

On Sat, Jun 27, 2015 at 11:01:30PM +0200, Alex Lyakas wrote:
> Results are following:
> - memory allocation failures happened only on the
> kmem_realloc_xfs_iext_realloc_indirect path for now
> - XFS hits memory re-allocation failures when it needs to allocate
> about 35KB. Sometimes allocation succeeds after few retries, but
> sometimes it takes several thousands of retries.

Allocations of 35kB are failing? Sounds like you have a serious
memory fragmentation problem if allocations that small are having
trouble.

> - All allocation failures happened on NFSv3 paths
> - Three inode numbers were reported as failing memory allocations.
> After several hours, "find -inum" is still searching for these
> inodes...this is a huge filesystem... Is there any other quicker
> (XFS-specific?) way to find the file based on inode number?

Not yet. You can use the bulkstat ioctl to find the inode by inode
number, then open-by-handle to get a fd for the inode to allow you
to read/write/stat/bmap/etc, but the only way to find the path right
now is to brute force it. That reverse mapping and parent pointer
stuff I'm working on at the moment will make lookups like this easy.

> Any recommendation how to move forward with this issue?
>
> Additional observation that I saw in my local system: writing files
> to XFS locally vs writing the same files via NFS (both 3 and 4), the
> amount of extents reported by "xfs_bmap" is much higher for the NFS
> case. For example, creating a new file and writing into it as
> follows:
> - write 4KB
> - skip 4KB (i.e., lseek to 4KB + 4KB)
> - write 4KB
> - skip 4KB
> ...
> Create a file of say 50MB this way.
>
> Locally it ends up with very few (1-5) extents. But same exact
> workload through NFS results in several thousands of extents.

NFS is likely resulting in out of order writes....

> The
> filesystem is mounted as "sync" in both cases.

I'm afraid to ask why, but that is likely your problem - synchronous
out of order writes from the NFS client will fragment the file
badly because it defeats both delayed allocation and speculative
preallocation because there is nothing to trigger the "don't remove
speculatieve prealloc on file close" heuristic used to avoid
fragmentation caused by out of order NFS writes....

On Sun, Jun 28, 2015 at 08:19:35PM +0200, Alex Lyakas wrote:
> through NFS. Trying the same 4KB-data/4KB-hole workload on small
> files of 2MB. When writing the file locally, I see that
> xfs_file_buffered_aio_write is always called with a single 4KB
> buffer:
> xfs_file_buffered_aio_write: inum=100663559 nr_segs=1
> seg #0: {.iov_base=0x18db8f0, .iov_len=4096}
>
> But when doing the same workload through NFS:
> xfs_file_buffered_aio_write: inum=167772423 nr_segs=2
> seg #0: {.iov_base=0xffff88006c1100a8, .iov_len=3928}
> seg #1: {.iov_base=0xffff88005556e000, .iov_len=168}
> There are always two such buffers in the IOV.

IOV format is irrelevant to the buffered write behaviour of XFS.

> I am still trying to debug why this results in XFS requiring much
> more extents to fit such workload. I tapped into some functions and
> seeing:
>
> Local workload:
> 6    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x0 ext_diff=0x1,
> nextents=0 new_size=16 if_bytes=0 if_real_bytes=0
> 25    xfs_iext_add: ifp=0xffff8800096de6b8 idx=0x1 ext_diff=0x1,
.....

Sequential allocation, all nice and contiguous.
Preallocation is clearly not being removed between writes.

> NFS workload:
....
> nextents=1 new_size=32 if_bytes=16 if_real_bytes=0
> 124    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1,
> nextents=2 new_size=48 if_bytes=32 if_real_bytes=0
> 130    xfs_iext_add: ifp=0xffff8800096df4b8 idx=0x1 ext_diff=0x1,

You're not getting sequential allocation, which further points to
problems with preallocation being removed on close.

> Number of extents is growing. But still I could not see why this is
> happening. Can you please give a hint why?

The sync mount option.

> 3) I tried to see what is the largest file XFS can maintain with
> this 4KB-data/4KB-hole workload on a VM with 5GB RAM. I was able to
> reach 146GB and almost 9M extents. There were a lof of "memory
> allocation deadlock" messages popping, but eventually allocation
> would succeed. Until finally, allocation could not succeed for 3
> minutes and hung-task panic occurred.

Well, yes. Each extent requires 32 bytes, plus an index page every
256 leaf pages (i.e. every 256*128=32k extents). So that extent list
requires roughly 300MB of memory, and a contiguous 270 page
allocation. vmalloc is not the answer here - it just papers over the
underlying problem: excessive fragmentation.

On Mon, Jun 29, 2015 at 03:02:23PM -0400, Brian Foster wrote:
> On Mon, Jun 29, 2015 at 07:59:00PM +0200, Alex Lyakas wrote:
> > Hi Brian,
> > Thanks for your comments.
> >
> > Here is the information you asked for:
> >
> > meta-data=/dev/dm-147            isize=256    agcount=67, 
> > agsize=268435440
> > blks
> >         =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=17825792000, 
> > imaxpct=5
> >         =                       sunit=16     swidth=160 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal               bsize=4096   blocks=521728, version=2
> >         =                       sectsz=512   sunit=16 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> > Mount options:
> > /dev/dm-147 /export/nfsvol xfs 
> > rw,sync,noatime,wsync,attr2,discard,inode64,allocsize=64k,logbsize=64k,sunit=128,swidth=1280,noquota
> > 0 0
> >
> > So yes, we are using "allocsize=64k", which influences the speculative
> > allocation logic. I did various experiments, and indeed when I remove 
> > this
> > "allocsize=64k", fragmentation is much lesser. (Tried also other things,
> > like using a single nfsd thread, mounting without "sync" and patching 
> > nfsd
> > to provide "nicer" IOV to vfs_write, but none of these helped). On the 
> > other
> > hand, we started using this option "allocsize=64k" to prevent aggressive
> > preallocation that we saw XFS doing on large QCOW files (VM images).
> >
>
> What was the problem with regard to preallocation and large VM images?
> The preallocation is not permanent and should be cleaned up if the file
> is inactive for a period of time (see the other prealloc FAQ entries).

A lot of change went into the speculative preallocation in the
kernels after 3.8, so I suspect we've already fixed whatever problem
was seen. Alex, it would be a good idea to try to reproduce those
problems on a current kernel to see if they still are present....

> > Still, when doing local IO to a mounted XFS, even with "allocsize=64k", 
> > we
> > still get very few extents. Still don't know why is this difference 
> > between
> > local IO and NFS. Would be great to receive a clue for that phenomena.
> >
>
> What exactly is your test in this case? I assume you're also testing
> with the same mount options and whatnot. One difference could be that
> NFS might involve more open-write-close cycles than a local write test,
> which could impact reclaim of preallocation. For example, what happens
> if you run something like the following locally?
>
> for i in $(seq 0 2 100); do
> xfs_io -fc "pwrite $((i * 4096)) 4k" /mnt/file
> done

That should produce similar results to run the NFS client. Years ago
back at SGI we used a tool written by Greg Banks called "Ddnfs" for
testing this sort of thing. it did open_by_handle()/close() around
each read/write syscall to emulate the NFS server IO pattern.

http://oss.sgi.com/projects/nfs/testtools/ddnfs-oss-20090302.tar.bz2

>
> This will do the strided writes while opening and closing the file each
> time and thus probably more closely matches what might be happening over
> NFS. Prealloc is typically trimmed on close, but there is an NFS
> specific heuristic that should detect this and let it hang around for
> longer in this case. Taking a quick look at that code shows that it is
> tied to the existence of delayed allocation blocks at close time,
> however. I suppose that might never trigger due to the sync mount
> option. What's the reason for using that one?

Right - it won't trigger because writeback occurs in the write()
context, so we have a clean inode when the fd is closed and
->release is called...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-07-06 18:47       ` Alex Lyakas
@ 2015-07-07  0:09         ` Dave Chinner
  2015-07-07  9:05           ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-07-07  0:09 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Danny Shavit, bfoster, Yair Hershko, Shyam Kaushik, xfs

On Mon, Jul 06, 2015 at 08:47:56PM +0200, Alex Lyakas wrote:
> Hi Dave, Brian,
> 
> [Compendium reply, trimmed and re-ordered]
> >I suppose that might never trigger due to the sync mount
> >option. What's the reason for using that one?
> 
> >I'm afraid to ask why, but that is likely your problem - synchronous
> >out of order writes from the NFS client will fragment the file
> >badly because it defeats both delayed allocation and speculative
> >preallocation because there is nothing to trigger the "don't remove
> >speculatieve prealloc on file close" heuristic used to avoid
> >fragmentation caused by out of order NFS writes....
> The main reason for using "sync" mount option is to avoid data loss
> in the case of crash.
> I did some experiments without this mount option, and indeed I see
> that same NFS workload results in lower fragmentation, especially
> for large files. However, since we do not consider at the moment
> removing the "sync" mount option, I did not debug further why it
> happens.

The NFS protocol handles server side data loss in the event of a
server crash. i.e. the client side commit is an "fsync" to the
server, and until the server responds with a success to the client
commit RPC the client side will continue to retry sending the data
to the server.

For the persepctive of metadata (i.e. directory entries) the use of
the "dirsync" mount option is sufficient for HA failover servers to
work correctly as it ensures that directory structure changes are always
committed to disk before the RPC response is sent back to the
client.

i.e. the "sync" mount option doesn't actually improve data integrity
of an NFS server when you look at the end-to-end NFS protocol
handling of async write data....

> >NFS is likely resulting in out of order writes....
> Yes, Dave, this appeared to be our issue.

Ok, no big surprise that fragmentation is happening...

> However, Dave, I am still not comfortable with XFS insisting on
> continuous space for the data fork in kmem_alloc. Consider, for
> example,  Brian's script. Nothing stops the user from doing that.
> Another example could be strided 4k NFS writes coming out of order.
> For these cases, speculative preallocation will not help, as we will
> receive a highly fragmented file with holes.

Except that users and applications don't tend to do this because
other filesystem barf on such fragmented files long before XFS does.
Hence, in general, application and users take steps to avoid this
sort of braindead allocation.  And then, of course, there is
xfs_fsr...

We had this discussion ~10 years ago when this code was originally
written and it was decided that the complexity of implementing a
fully generic, scalable solution was not worth the effort as files
with massive numbers of extents cause other performance problems
long before memory allocation should be an issue.

That said, there is now generic infrastructure that makes the
complexity less of a problem, and I do have some patches that I've
been working on in the background to move to a generic structure.
Other filesystems like ext4, btrfs and f2fs have moved to extremely
fine-grained extent trees, but they have been causing all sorts of
interesting scalablity and memory reclaim problems so I don't think
we want to go that way.

However, the problem we actually need to solve is not having a
fine-grained extent tree, but that of demand paging of in-memory
extent lists so that we don't need to keep millions of extents in
memory at a time.  That's where we need to go with this, and I have
some early, incomplete patches that move towards a btree based
structure for doing this....

> Another example, Dave, can you please look at the stack trace in
> [1]. (It doesn't make much sense, but this is what we got). Could
> something like this happen:
> - VFS tells XFS to unlink an inode
> - XFS tries to reallocate the extents fork via xfs_inactive path
> - there is no continuous memory, so kernel (somehow) wants to evict
> the same inode, but cannot lock it due to XFS already holding the
> lock???

Simply not possible. Memory reclaim can't evict an inode that has an
active reference count.  Also, unlinked inodes never get evicted
from memory reclaim - the final inode reference release will do the
reclaim, and that always occurs in a process context of some
kind....

> [454509.864025]  [<ffffffffa075e303>] ? xfs_fs_evict_inode+0x93/0x100 [xfs]
> [454509.864025]  [<ffffffff811b5530>] ? evict+0xc0/0x1d0
> [454509.864025]  [<ffffffff811b5e62>] ? iput_final+0xe2/0x170
> [454509.864025]  [<ffffffff811b5f2e>] ? iput+0x3e/0x50
> [454509.864025]  [<ffffffff811b0e88>] ? dentry_unlink_inode+0xd8/0x110
> [454509.864025]  [<ffffffff811b0f7e>] ? d_delete+0xbe/0xd0
> [454509.864025]  [<ffffffff811a663e>] ? vfs_unlink.part.27+0xde/0xf0
> [454509.864025]  [<ffffffff811a847c>] ? vfs_unlink+0x3c/0x60
> [454509.864025]  [<ffffffffa01e90c3>] ? nfsd_unlink+0x183/0x230 [nfsd]
> [454509.864025]  [<ffffffffa01f871d>] ? nfsd4_remove+0x6d/0x130 [nfsd]

As you can see here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-07-07  0:09         ` Dave Chinner
@ 2015-07-07  9:05           ` Christoph Hellwig
  2015-07-23 15:39             ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2015-07-07  9:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Shyam Kaushik, bfoster, Yair Hershko, xfs, Danny Shavit, Alex Lyakas

On Tue, Jul 07, 2015 at 10:09:11AM +1000, Dave Chinner wrote:
> server crash. i.e. the client side commit is an "fsync" to the
> server, and until the server responds with a success to the client
> commit RPC the client side will continue to retry sending the data
> to the server.
> 
> For the persepctive of metadata (i.e. directory entries) the use of
> the "dirsync" mount option is sufficient for HA failover servers to
> work correctly as it ensures that directory structure changes are always
> committed to disk before the RPC response is sent back to the
> client.
> 
> i.e. the "sync" mount option doesn't actually improve data integrity
> of an NFS server when you look at the end-to-end NFS protocol
> handling of async write data....


You don't need dirsync either.  NFS does the right sync usin the
commit_metadata export operation without using that big hammer.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-07-07  9:05           ` Christoph Hellwig
@ 2015-07-23 15:39             ` Alex Lyakas
  2015-07-23 23:09               ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2015-07-23 15:39 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: Danny Shavit, bfoster, Yair Hershko, Shyam Kaushik, xfs

Hi Dave,
Just for completeness,  XFS speculative preallocation (that is based now 
upon the size of the last extent) can still grow up to 4Gb-8Gb (depending on 
which patches we are pulling). As a result, xfs_iozero can still sometimes 
trigger 1-2GB writes of zeros in one shot. This turns out to be a bit 
unfriendly to the drives in some configurations. So we have applied a custom 
patch to limit the speculative preallocation to 32Mb.

Final code will be in the same place.

Thanks,
Alex.

-----Original Message----- 
From: Christoph Hellwig
Sent: 07 July, 2015 11:05 AM
To: Dave Chinner
Cc: Alex Lyakas ; Danny Shavit ; bfoster@redhat.com ; Yair Hershko ; Shyam 
Kaushik ; xfs@oss.sgi.com
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

On Tue, Jul 07, 2015 at 10:09:11AM +1000, Dave Chinner wrote:
> server crash. i.e. the client side commit is an "fsync" to the
> server, and until the server responds with a success to the client
> commit RPC the client side will continue to retry sending the data
> to the server.
>
> For the persepctive of metadata (i.e. directory entries) the use of
> the "dirsync" mount option is sufficient for HA failover servers to
> work correctly as it ensures that directory structure changes are always
> committed to disk before the RPC response is sent back to the
> client.
>
> i.e. the "sync" mount option doesn't actually improve data integrity
> of an NFS server when you look at the end-to-end NFS protocol
> handling of async write data....


You don't need dirsync either.  NFS does the right sync usin the
commit_metadata export operation without using that big hammer.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-07-23 15:39             ` Alex Lyakas
@ 2015-07-23 23:09               ` Dave Chinner
  2016-04-05 18:10                 ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-07-23 23:09 UTC (permalink / raw)
  To: Alex Lyakas
  Cc: Danny Shavit, Shyam Kaushik, bfoster, Yair Hershko, xfs,
	Christoph Hellwig

On Thu, Jul 23, 2015 at 05:39:28PM +0200, Alex Lyakas wrote:
> Hi Dave,
> Just for completeness,  XFS speculative preallocation (that is based
> now upon the size of the last extent) can still grow up to 4Gb-8Gb
> (depending on which patches we are pulling). As a result, xfs_iozero
> can still sometimes trigger 1-2GB writes of zeros in one shot. This
> turns out to be a bit unfriendly to the drives in some
> configurations. So we have applied a custom patch to limit the
> speculative preallocation to 32Mb.

It would be much better to change xfs_zero_eof() to convert extents
beyond EOF to unwritten extents rather than zero them. That way you
still get the benefits of the large speculative prealloc but without
the zeroing overhead that wierd sparse write patterns can trigger.

I just haven't got around to doing this because it hasn't been
reported as a significant problem for anyone until now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2015-07-23 23:09               ` Dave Chinner
@ 2016-04-05 18:10                 ` Alex Lyakas
  2016-04-05 20:41                   ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2016-04-05 18:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Danny Shavit, Shyam Kaushik, bfoster, Yair Hershko, xfs,
	Christoph Hellwig

Hello Dave, Brian, Christoph,

We are still encountering cases, in which different IO patterns beat XFS 
preallocation schemes, resuling in highly fragmented files, having 100s of 
thousands and sometimes millions of extents. In these cases XFS tries to 
allocate large arrays of xfs_ext_irec_t structure with kmalloc, and this 
often gets into numerous retries, and sometimes triggers hung task panic 
(due to some other thread wanting to access the same file).

We made a change to call kmem_zalloc_large, which resorts to __vmalloc in 
case kmalloc fails. kmem_free is already handling vmalloc addresses 
correctly. The change is only for the allocation done in 
xfs_iext_realloc_indirect, as this is the only place, in which we have seen 
the issue.

Final code, as usual, will be in 
https://github.com/zadarastorage/zadara-xfs-pushback

Thanks,
Alex.


-----Original Message----- 
From: Dave Chinner
Sent: Friday, July 24, 2015 2:09 AM
To: Alex Lyakas
Cc: Christoph Hellwig ; Danny Shavit ; bfoster@redhat.com ; Yair Hershko ; 
Shyam Kaushik ; xfs@oss.sgi.com
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

On Thu, Jul 23, 2015 at 05:39:28PM +0200, Alex Lyakas wrote:
> Hi Dave,
> Just for completeness,  XFS speculative preallocation (that is based
> now upon the size of the last extent) can still grow up to 4Gb-8Gb
> (depending on which patches we are pulling). As a result, xfs_iozero
> can still sometimes trigger 1-2GB writes of zeros in one shot. This
> turns out to be a bit unfriendly to the drives in some
> configurations. So we have applied a custom patch to limit the
> speculative preallocation to 32Mb.

It would be much better to change xfs_zero_eof() to convert extents
beyond EOF to unwritten extents rather than zero them. That way you
still get the benefits of the large speculative prealloc but without
the zeroing overhead that wierd sparse write patterns can trigger.

I just haven't got around to doing this because it hasn't been
reported as a significant problem for anyone until now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2016-04-05 18:10                 ` Alex Lyakas
@ 2016-04-05 20:41                   ` Dave Chinner
  2016-04-06 16:39                     ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2016-04-05 20:41 UTC (permalink / raw)
  To: Alex Lyakas
  Cc: Danny Shavit, Shyam Kaushik, bfoster, Yair Hershko, xfs,
	Christoph Hellwig

On Tue, Apr 05, 2016 at 09:10:06PM +0300, Alex Lyakas wrote:
> Hello Dave, Brian, Christoph,
> 
> We are still encountering cases, in which different IO patterns beat
> XFS preallocation schemes, resuling in highly fragmented files,
> having 100s of thousands and sometimes millions of extents. In these
> cases XFS tries to allocate large arrays of xfs_ext_irec_t structure
> with kmalloc, and this often gets into numerous retries, and
> sometimes triggers hung task panic (due to some other thread wanting
> to access the same file).
> 
> We made a change to call kmem_zalloc_large, which resorts to
> __vmalloc in case kmalloc fails. kmem_free is already handling
> vmalloc addresses correctly. The change is only for the allocation
> done in xfs_iext_realloc_indirect, as this is the only place, in
> which we have seen the issue.

As I've said before, vmalloc is not a solution we can use in
general.  32 bit systems have less vmalloc area than normal kernel
memory (e.g. ia32 has 128MB of vmalloc space vs 896MB of kernel
address space by default) and hence if we get large vmap allocation
requests for non-temporary, not directly reclaimable memory then
we'll end up with worse problems than we already have due to vmalloc
area exhaustion.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2016-04-05 20:41                   ` Dave Chinner
@ 2016-04-06 16:39                     ` Alex Lyakas
  2016-04-06 20:57                       ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2016-04-06 16:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Danny Shavit, Shyam Kaushik, bfoster, Yair Hershko, xfs,
	Christoph Hellwig

Hello Dave,

Thank you for your response. We understand your concerns and will do further 
testing with the vmalloc alternative.

Meanwhile, I found another issue happening when files have many extents. 
When unmounting XFS, we call
xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy
This goes over the whole indirection array and calls xfs_iext_irec_remove 
for each one of the erps (from the last one to the first one). As a result, 
we keep shrinking (reallocating actually) the indirection array until we 
shrink out all of its elements. When we have files with huge numbers of 
extents, umount takes 30-80 sec, depending on the amount of files that XFS 
loaded and the amount of indirection entries of each file. The unmount stack 
looks like [1].

That patch in [2] seems to address the issue. Do you think it is reasonable? 
It was tested only on kernel 3.18.19.

Thanks,
Alex.

[1]
[<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
[<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
[<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
[<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
[<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
[<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
[<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
[<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
[<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
[<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
[<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
[<ffffffff811ea207>] kill_block_super+0x27/0x70
[<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
[<ffffffff811eaaee>] deactivate_super+0x4e/0x70
[<ffffffff81207593>] cleanup_mnt+0x43/0x90
[<ffffffff81207632>] __cleanup_mnt+0x12/0x20
[<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
[<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
[<ffffffff81717c6f>] int_signal+0x12/0x17

[2]
--- /mnt/work/alex/tmp/code/prev_xfs2/fs/xfs/libxfs/xfs_inode_fork.c 
2016-04-06 16:35:51.172255372 +0300
+++ fs/xfs/libxfs/xfs_inode_fork.c      2016-04-06 19:25:55.349593353 +0300
@@ -1499,34 +1499,48 @@
        kmem_free(ifp->if_u1.if_ext_irec);
        ifp->if_flags &= ~XFS_IFEXTIREC;
        ifp->if_u1.if_extents = ep;
        ifp->if_bytes = size;
        if (nextents < XFS_LINEAR_EXTS) {
                xfs_iext_realloc_direct(ifp, size);
        }
}

/*
+ * Remove all records from the indirection array.
+ */
+STATIC void
+xfs_iext_irec_remove_all(xfs_ifork_t *ifp)             /* inode fork 
pointer */
+{
+       int             nlists;         /* number of irec's (ex lists) */
+       int             i;              /* loop counter */
+
+       ASSERT(ifp->if_flags & XFS_IFEXTIREC);
+       nlists = ifp->if_real_bytes / XFS_IEXT_BUFSZ;
+       for (i = 0; i < nlists; i++) {
+               xfs_ext_irec_t *erp = &ifp->if_u1.if_ext_irec[i];
+               if (erp->er_extbuf)
+                       kmem_free(erp->er_extbuf);
+       }
+       kmem_free(ifp->if_u1.if_ext_irec);
+       ifp->if_real_bytes = 0;
+}
+
+/*
  * Free incore file extents.
  */
void
xfs_iext_destroy(
        xfs_ifork_t     *ifp)           /* inode fork pointer */
{
        if (ifp->if_flags & XFS_IFEXTIREC) {
-               int     erp_idx;
-               int     nlists;
-
-               nlists = ifp->if_real_bytes / XFS_IEXT_BUFSZ;
-               for (erp_idx = nlists - 1; erp_idx >= 0 ; erp_idx--) {
-                       xfs_iext_irec_remove(ifp, erp_idx);
-               }
+               xfs_iext_irec_remove_all(ifp);
                ifp->if_flags &= ~XFS_IFEXTIREC;
        } else if (ifp->if_real_bytes) {
                kmem_free(ifp->if_u1.if_extents);
        } else if (ifp->if_bytes) {
                memset(ifp->if_u2.if_inline_ext, 0, XFS_INLINE_EXTS *
                        sizeof(xfs_bmbt_rec_t));
        }
        ifp->if_u1.if_extents = NULL;
        ifp->if_real_bytes = 0;
        ifp->if_bytes = 0;


-----Original Message----- 
From: Dave Chinner
Sent: Tuesday, April 05, 2016 11:41 PM
To: Alex Lyakas
Cc: Christoph Hellwig ; Danny Shavit ; bfoster@redhat.com ; Yair Hershko ; 
Shyam Kaushik ; xfs@oss.sgi.com
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation 
deadlock"

On Tue, Apr 05, 2016 at 09:10:06PM +0300, Alex Lyakas wrote:
> Hello Dave, Brian, Christoph,
>
> We are still encountering cases, in which different IO patterns beat
> XFS preallocation schemes, resuling in highly fragmented files,
> having 100s of thousands and sometimes millions of extents. In these
> cases XFS tries to allocate large arrays of xfs_ext_irec_t structure
> with kmalloc, and this often gets into numerous retries, and
> sometimes triggers hung task panic (due to some other thread wanting
> to access the same file).
>
> We made a change to call kmem_zalloc_large, which resorts to
> __vmalloc in case kmalloc fails. kmem_free is already handling
> vmalloc addresses correctly. The change is only for the allocation
> done in xfs_iext_realloc_indirect, as this is the only place, in
> which we have seen the issue.

As I've said before, vmalloc is not a solution we can use in
general.  32 bit systems have less vmalloc area than normal kernel
memory (e.g. ia32 has 128MB of vmalloc space vs 896MB of kernel
address space by default) and hence if we get large vmap allocation
requests for non-temporary, not directly reclaimable memory then
we'll end up with worse problems than we already have due to vmalloc
area exhaustion.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock"
  2016-04-06 16:39                     ` Alex Lyakas
@ 2016-04-06 20:57                       ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2016-04-06 20:57 UTC (permalink / raw)
  To: Alex Lyakas
  Cc: Danny Shavit, Shyam Kaushik, bfoster, Yair Hershko, xfs,
	Christoph Hellwig

On Wed, Apr 06, 2016 at 07:39:21PM +0300, Alex Lyakas wrote:
> Hello Dave,
> 
> Thank you for your response. We understand your concerns and will do
> further testing with the vmalloc alternative.
> 
> Meanwhile, I found another issue happening when files have many
> extents. When unmounting XFS, we call
> xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy
> This goes over the whole indirection array and calls
> xfs_iext_irec_remove for each one of the erps (from the last one to
> the first one). As a result, we keep shrinking (reallocating
> actually) the indirection array until we shrink out all of its
> elements. When we have files with huge numbers of extents, umount
> takes 30-80 sec, depending on the amount of files that XFS loaded
> and the amount of indirection entries of each file. The unmount
> stack looks like [1].
> 
> That patch in [2] seems to address the issue. Do you think it is
> reasonable? It was tested only on kernel 3.18.19.

Looks like a good optimisation to make. Can you send it as proper
patch (separate thread) with this description and your signed-off-by
on it?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-04-06 20:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-23  8:18 xfs_iext_realloc_indirect and "XFS: possible memory allocation deadlock" Alex Lyakas
2015-06-23 20:18 ` Dave Chinner
2015-06-27 21:01   ` Alex Lyakas
2015-06-28 18:19     ` Alex Lyakas
2015-06-29 11:43     ` Brian Foster
2015-06-29 17:59       ` Alex Lyakas
2015-06-29 19:02         ` Brian Foster
2015-06-29 22:26     ` Dave Chinner
2015-07-06 18:47       ` Alex Lyakas
2015-07-07  0:09         ` Dave Chinner
2015-07-07  9:05           ` Christoph Hellwig
2015-07-23 15:39             ` Alex Lyakas
2015-07-23 23:09               ` Dave Chinner
2016-04-05 18:10                 ` Alex Lyakas
2016-04-05 20:41                   ` Dave Chinner
2016-04-06 16:39                     ` Alex Lyakas
2016-04-06 20:57                       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.