All of lore.kernel.org
 help / color / mirror / Atom feed
* Failing XFS memory allocation
@ 2016-03-23 10:15 Nikolay Borisov
  2016-03-23 12:43 ` Brian Foster
  2016-03-24  9:33 ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Nikolay Borisov @ 2016-03-23 10:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hello,

So I have an XFS filesystem which houses 2 2.3T sparse files, which are
loop-mounted. Recently I migrated a server to a 4.4.6 kernel and this
morning I observed the following in my dmesg:

XFS: loop0(15174) possible memory allocation deadlock size 107168 in
kmem_alloc (mode:0x2400240)

the mode is essentially (GFP_KERNEL | GFP_NOWARN) &= ~__GFP_FS.
Here is the site of the loop file in case it matters:

du -h --apparent-size /storage/loop/file1
2.3T	/storage/loop/file1

du -h /storage/loop/file1
878G	/storage/loop/file1

And this string is repeated multiple times. Looking at the output of
"echo w > /proc/sysrq-trigger" I see the following suspicious entry:

loop0           D ffff881fe081f038     0 15174      2 0x00000000
 ffff881fe081f038 ffff883ff29fa700 ffff881fecb70d00 ffff88407fffae00
 0000000000000000 0000000502404240 ffffffff81e30d60 0000000000000000
 0000000000000000 ffff881f00000003 0000000000000282 ffff883f00000000
Call Trace:
 [<ffffffff8163ac01>] ? _raw_spin_lock_irqsave+0x21/0x60
 [<ffffffff81636fd7>] schedule+0x47/0x90
 [<ffffffff81639f03>] schedule_timeout+0x113/0x1e0
 [<ffffffff810ac580>] ? lock_timer_base+0x80/0x80
 [<ffffffff816363d4>] io_schedule_timeout+0xa4/0x110
 [<ffffffff8114aadf>] congestion_wait+0x7f/0x130
 [<ffffffff810939e0>] ? woken_wake_function+0x20/0x20
 [<ffffffffa0283bac>] kmem_alloc+0x8c/0x120 [xfs]
 [<ffffffff81181751>] ? __kmalloc+0x121/0x250
 [<ffffffffa0283c73>] kmem_realloc+0x33/0x80 [xfs]
 [<ffffffffa02546cd>] xfs_iext_realloc_indirect+0x3d/0x60 [xfs]
 [<ffffffffa02548cf>] xfs_iext_irec_new+0x3f/0xf0 [xfs]
 [<ffffffffa0254c0d>] xfs_iext_add_indirect_multi+0x14d/0x210 [xfs]
 [<ffffffffa02554b5>] xfs_iext_add+0xc5/0x230 [xfs]
 [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffffa0256269>] xfs_iext_insert+0x59/0x110 [xfs]
 [<ffffffffa0230928>] ? xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
 [<ffffffffa0230928>] xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
 [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff8112b725>] ? mempool_alloc+0x65/0x180
 [<ffffffffa02543d8>] ? xfs_iext_get_ext+0x38/0x70 [xfs]
 [<ffffffffa0254e8d>] ? xfs_iext_bno_to_ext+0xed/0x150 [xfs]
 [<ffffffffa02311b5>] xfs_bmapi_reserve_delalloc+0x225/0x250 [xfs]
 [<ffffffffa023131e>] xfs_bmapi_delay+0x13e/0x290 [xfs]
 [<ffffffffa02730ad>] xfs_iomap_write_delay+0x17d/0x300 [xfs]
 [<ffffffffa022e434>] ? xfs_bmapi_read+0x114/0x330 [xfs]
 [<ffffffffa025ddc5>] __xfs_get_blocks+0x585/0xa90 [xfs]
 [<ffffffff81324b53>] ? __percpu_counter_add+0x63/0x80
 [<ffffffff811374cd>] ? account_page_dirtied+0xed/0x1b0
 [<ffffffff811cfc59>] ? alloc_buffer_head+0x49/0x60
 [<ffffffff811d07c0>] ? alloc_page_buffers+0x60/0xb0
 [<ffffffff811d13e5>] ? create_empty_buffers+0x45/0xc0
 [<ffffffffa025e324>] xfs_get_blocks+0x14/0x20 [xfs]
 [<ffffffff811d34e2>] __block_write_begin+0x1c2/0x580
 [<ffffffffa025e310>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
 [<ffffffffa025bbb1>] xfs_vm_write_begin+0x61/0xf0 [xfs]
 [<ffffffff81127e50>] generic_perform_write+0xd0/0x1f0
 [<ffffffffa026a341>] xfs_file_buffered_aio_write+0xe1/0x240 [xfs]
 [<ffffffff812e16d2>] ? bt_clear_tag+0xb2/0xd0
 [<ffffffffa026ab87>] xfs_file_write_iter+0x167/0x170 [xfs]
 [<ffffffff81199d76>] vfs_iter_write+0x76/0xa0
 [<ffffffffa03fb735>] lo_write_bvec+0x65/0x100 [loop]
 [<ffffffffa03fd589>] loop_queue_work+0x689/0x924 [loop]
 [<ffffffff8163ba52>] ? retint_kernel+0x10/0x10
 [<ffffffff81074d71>] kthread_worker_fn+0x61/0x1c0
 [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
 [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
 [<ffffffff810744d7>] kthread+0xd7/0xf0
 [<ffffffff8107d22e>] ? schedule_tail+0x1e/0xd0
 [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
 [<ffffffff8163b2af>] ret_from_fork+0x3f/0x70
 [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80

So this seems that there are writes to the loop device being queued and
while being served XFS has to do some internal memory allocation to fit
the new data, however due to some *uknown* reason it fails and starts
looping in kmem_alloc.  I didn't see any OOM reports so presumably the
server was not out of memory, but unfortunately I didn't check the
memory fragmentation, though I collected a crash dump in case you need
further info.

The one thing which bugs me is that XFS tried to allocate 107 contiguous
kb which is page-order-26 isn't this waaaaay too big and almost never
satisfiable, despite direct/bg reclaim to be enabled? For now I've
reverted to using 3.12.52 kernel, where this issue hasn't been observed
(yet) any ideas would be much appreciated.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 10:15 Failing XFS memory allocation Nikolay Borisov
@ 2016-03-23 12:43 ` Brian Foster
  2016-03-23 12:56   ` Nikolay Borisov
  2016-03-24  9:33 ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Brian Foster @ 2016-03-23 12:43 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: xfs

On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
> Hello,
> 
> So I have an XFS filesystem which houses 2 2.3T sparse files, which are
> loop-mounted. Recently I migrated a server to a 4.4.6 kernel and this
> morning I observed the following in my dmesg:
> 
> XFS: loop0(15174) possible memory allocation deadlock size 107168 in
> kmem_alloc (mode:0x2400240)
> 

Is there a stack trace associated with this message?

> the mode is essentially (GFP_KERNEL | GFP_NOWARN) &= ~__GFP_FS.
> Here is the site of the loop file in case it matters:
> 
> du -h --apparent-size /storage/loop/file1
> 2.3T	/storage/loop/file1
> 
> du -h /storage/loop/file1
> 878G	/storage/loop/file1
> 
> And this string is repeated multiple times. Looking at the output of
> "echo w > /proc/sysrq-trigger" I see the following suspicious entry:
> 
> loop0           D ffff881fe081f038     0 15174      2 0x00000000
>  ffff881fe081f038 ffff883ff29fa700 ffff881fecb70d00 ffff88407fffae00
>  0000000000000000 0000000502404240 ffffffff81e30d60 0000000000000000
>  0000000000000000 ffff881f00000003 0000000000000282 ffff883f00000000
> Call Trace:
>  [<ffffffff8163ac01>] ? _raw_spin_lock_irqsave+0x21/0x60
>  [<ffffffff81636fd7>] schedule+0x47/0x90
>  [<ffffffff81639f03>] schedule_timeout+0x113/0x1e0
>  [<ffffffff810ac580>] ? lock_timer_base+0x80/0x80
>  [<ffffffff816363d4>] io_schedule_timeout+0xa4/0x110
>  [<ffffffff8114aadf>] congestion_wait+0x7f/0x130
>  [<ffffffff810939e0>] ? woken_wake_function+0x20/0x20
>  [<ffffffffa0283bac>] kmem_alloc+0x8c/0x120 [xfs]
>  [<ffffffff81181751>] ? __kmalloc+0x121/0x250
>  [<ffffffffa0283c73>] kmem_realloc+0x33/0x80 [xfs]
>  [<ffffffffa02546cd>] xfs_iext_realloc_indirect+0x3d/0x60 [xfs]
>  [<ffffffffa02548cf>] xfs_iext_irec_new+0x3f/0xf0 [xfs]
>  [<ffffffffa0254c0d>] xfs_iext_add_indirect_multi+0x14d/0x210 [xfs]
>  [<ffffffffa02554b5>] xfs_iext_add+0xc5/0x230 [xfs]

It looks like it's working to add a new extent to the in-core extent
list. If this is the stack associated with the warning message (combined
with the large alloc size), I wonder if there's a fragmentation issue on
the file leading to an excessive number of extents.

What does 'xfs_bmap -v /storage/loop/file1' show?

Brian

>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
>  [<ffffffffa0256269>] xfs_iext_insert+0x59/0x110 [xfs]
>  [<ffffffffa0230928>] ? xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
>  [<ffffffffa0230928>] xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
>  [<ffffffff8112b725>] ? mempool_alloc+0x65/0x180
>  [<ffffffffa02543d8>] ? xfs_iext_get_ext+0x38/0x70 [xfs]
>  [<ffffffffa0254e8d>] ? xfs_iext_bno_to_ext+0xed/0x150 [xfs]
>  [<ffffffffa02311b5>] xfs_bmapi_reserve_delalloc+0x225/0x250 [xfs]
>  [<ffffffffa023131e>] xfs_bmapi_delay+0x13e/0x290 [xfs]
>  [<ffffffffa02730ad>] xfs_iomap_write_delay+0x17d/0x300 [xfs]
>  [<ffffffffa022e434>] ? xfs_bmapi_read+0x114/0x330 [xfs]
>  [<ffffffffa025ddc5>] __xfs_get_blocks+0x585/0xa90 [xfs]
>  [<ffffffff81324b53>] ? __percpu_counter_add+0x63/0x80
>  [<ffffffff811374cd>] ? account_page_dirtied+0xed/0x1b0
>  [<ffffffff811cfc59>] ? alloc_buffer_head+0x49/0x60
>  [<ffffffff811d07c0>] ? alloc_page_buffers+0x60/0xb0
>  [<ffffffff811d13e5>] ? create_empty_buffers+0x45/0xc0
>  [<ffffffffa025e324>] xfs_get_blocks+0x14/0x20 [xfs]
>  [<ffffffff811d34e2>] __block_write_begin+0x1c2/0x580
>  [<ffffffffa025e310>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
>  [<ffffffffa025bbb1>] xfs_vm_write_begin+0x61/0xf0 [xfs]
>  [<ffffffff81127e50>] generic_perform_write+0xd0/0x1f0
>  [<ffffffffa026a341>] xfs_file_buffered_aio_write+0xe1/0x240 [xfs]
>  [<ffffffff812e16d2>] ? bt_clear_tag+0xb2/0xd0
>  [<ffffffffa026ab87>] xfs_file_write_iter+0x167/0x170 [xfs]
>  [<ffffffff81199d76>] vfs_iter_write+0x76/0xa0
>  [<ffffffffa03fb735>] lo_write_bvec+0x65/0x100 [loop]
>  [<ffffffffa03fd589>] loop_queue_work+0x689/0x924 [loop]
>  [<ffffffff8163ba52>] ? retint_kernel+0x10/0x10
>  [<ffffffff81074d71>] kthread_worker_fn+0x61/0x1c0
>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
>  [<ffffffff810744d7>] kthread+0xd7/0xf0
>  [<ffffffff8107d22e>] ? schedule_tail+0x1e/0xd0
>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
>  [<ffffffff8163b2af>] ret_from_fork+0x3f/0x70
>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
> 
> So this seems that there are writes to the loop device being queued and
> while being served XFS has to do some internal memory allocation to fit
> the new data, however due to some *uknown* reason it fails and starts
> looping in kmem_alloc.  I didn't see any OOM reports so presumably the
> server was not out of memory, but unfortunately I didn't check the
> memory fragmentation, though I collected a crash dump in case you need
> further info.
> 
> The one thing which bugs me is that XFS tried to allocate 107 contiguous
> kb which is page-order-26 isn't this waaaaay too big and almost never
> satisfiable, despite direct/bg reclaim to be enabled? For now I've
> reverted to using 3.12.52 kernel, where this issue hasn't been observed
> (yet) any ideas would be much appreciated.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 12:43 ` Brian Foster
@ 2016-03-23 12:56   ` Nikolay Borisov
  2016-03-23 13:10     ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2016-03-23 12:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs



On 03/23/2016 02:43 PM, Brian Foster wrote:
> On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
>> Hello,
>>
>> So I have an XFS filesystem which houses 2 2.3T sparse files, which are
>> loop-mounted. Recently I migrated a server to a 4.4.6 kernel and this
>> morning I observed the following in my dmesg:
>>
>> XFS: loop0(15174) possible memory allocation deadlock size 107168 in
>> kmem_alloc (mode:0x2400240)
>>
> 
> Is there a stack trace associated with this message?
> 
>> the mode is essentially (GFP_KERNEL | GFP_NOWARN) &= ~__GFP_FS.
>> Here is the site of the loop file in case it matters:
>>
>> du -h --apparent-size /storage/loop/file1
>> 2.3T	/storage/loop/file1
>>
>> du -h /storage/loop/file1
>> 878G	/storage/loop/file1
>>
>> And this string is repeated multiple times. Looking at the output of
>> "echo w > /proc/sysrq-trigger" I see the following suspicious entry:
>>
>> loop0           D ffff881fe081f038     0 15174      2 0x00000000
>>  ffff881fe081f038 ffff883ff29fa700 ffff881fecb70d00 ffff88407fffae00
>>  0000000000000000 0000000502404240 ffffffff81e30d60 0000000000000000
>>  0000000000000000 ffff881f00000003 0000000000000282 ffff883f00000000
>> Call Trace:
>>  [<ffffffff8163ac01>] ? _raw_spin_lock_irqsave+0x21/0x60
>>  [<ffffffff81636fd7>] schedule+0x47/0x90
>>  [<ffffffff81639f03>] schedule_timeout+0x113/0x1e0
>>  [<ffffffff810ac580>] ? lock_timer_base+0x80/0x80
>>  [<ffffffff816363d4>] io_schedule_timeout+0xa4/0x110
>>  [<ffffffff8114aadf>] congestion_wait+0x7f/0x130
>>  [<ffffffff810939e0>] ? woken_wake_function+0x20/0x20
>>  [<ffffffffa0283bac>] kmem_alloc+0x8c/0x120 [xfs]
>>  [<ffffffff81181751>] ? __kmalloc+0x121/0x250
>>  [<ffffffffa0283c73>] kmem_realloc+0x33/0x80 [xfs]
>>  [<ffffffffa02546cd>] xfs_iext_realloc_indirect+0x3d/0x60 [xfs]
>>  [<ffffffffa02548cf>] xfs_iext_irec_new+0x3f/0xf0 [xfs]
>>  [<ffffffffa0254c0d>] xfs_iext_add_indirect_multi+0x14d/0x210 [xfs]
>>  [<ffffffffa02554b5>] xfs_iext_add+0xc5/0x230 [xfs]
> 
> It looks like it's working to add a new extent to the in-core extent
> list. If this is the stack associated with the warning message (combined
> with the large alloc size), I wonder if there's a fragmentation issue on
> the file leading to an excessive number of extents.

Yes this is the stack trace associated.

> 
> What does 'xfs_bmap -v /storage/loop/file1' show?

It spews a lot of stuff but here is a summary, more detailed info can be
provided if you need it:

xfs_bmap -v /storage/loop/file1 | wc -l
900908
xfs_bmap -v /storage/loop/file1 | grep -c hole
94568

Also, what would constitute an "excessive number of extents"?

> 
> Brian
> 
>>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
>>  [<ffffffffa0256269>] xfs_iext_insert+0x59/0x110 [xfs]
>>  [<ffffffffa0230928>] ? xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
>>  [<ffffffffa0230928>] xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
>>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
>>  [<ffffffff8112b725>] ? mempool_alloc+0x65/0x180
>>  [<ffffffffa02543d8>] ? xfs_iext_get_ext+0x38/0x70 [xfs]
>>  [<ffffffffa0254e8d>] ? xfs_iext_bno_to_ext+0xed/0x150 [xfs]
>>  [<ffffffffa02311b5>] xfs_bmapi_reserve_delalloc+0x225/0x250 [xfs]
>>  [<ffffffffa023131e>] xfs_bmapi_delay+0x13e/0x290 [xfs]
>>  [<ffffffffa02730ad>] xfs_iomap_write_delay+0x17d/0x300 [xfs]
>>  [<ffffffffa022e434>] ? xfs_bmapi_read+0x114/0x330 [xfs]
>>  [<ffffffffa025ddc5>] __xfs_get_blocks+0x585/0xa90 [xfs]
>>  [<ffffffff81324b53>] ? __percpu_counter_add+0x63/0x80
>>  [<ffffffff811374cd>] ? account_page_dirtied+0xed/0x1b0
>>  [<ffffffff811cfc59>] ? alloc_buffer_head+0x49/0x60
>>  [<ffffffff811d07c0>] ? alloc_page_buffers+0x60/0xb0
>>  [<ffffffff811d13e5>] ? create_empty_buffers+0x45/0xc0
>>  [<ffffffffa025e324>] xfs_get_blocks+0x14/0x20 [xfs]
>>  [<ffffffff811d34e2>] __block_write_begin+0x1c2/0x580
>>  [<ffffffffa025e310>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
>>  [<ffffffffa025bbb1>] xfs_vm_write_begin+0x61/0xf0 [xfs]
>>  [<ffffffff81127e50>] generic_perform_write+0xd0/0x1f0
>>  [<ffffffffa026a341>] xfs_file_buffered_aio_write+0xe1/0x240 [xfs]
>>  [<ffffffff812e16d2>] ? bt_clear_tag+0xb2/0xd0
>>  [<ffffffffa026ab87>] xfs_file_write_iter+0x167/0x170 [xfs]
>>  [<ffffffff81199d76>] vfs_iter_write+0x76/0xa0
>>  [<ffffffffa03fb735>] lo_write_bvec+0x65/0x100 [loop]
>>  [<ffffffffa03fd589>] loop_queue_work+0x689/0x924 [loop]
>>  [<ffffffff8163ba52>] ? retint_kernel+0x10/0x10
>>  [<ffffffff81074d71>] kthread_worker_fn+0x61/0x1c0
>>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
>>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
>>  [<ffffffff810744d7>] kthread+0xd7/0xf0
>>  [<ffffffff8107d22e>] ? schedule_tail+0x1e/0xd0
>>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
>>  [<ffffffff8163b2af>] ret_from_fork+0x3f/0x70
>>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
>>
>> So this seems that there are writes to the loop device being queued and
>> while being served XFS has to do some internal memory allocation to fit
>> the new data, however due to some *uknown* reason it fails and starts
>> looping in kmem_alloc.  I didn't see any OOM reports so presumably the
>> server was not out of memory, but unfortunately I didn't check the
>> memory fragmentation, though I collected a crash dump in case you need
>> further info.
>>
>> The one thing which bugs me is that XFS tried to allocate 107 contiguous
>> kb which is page-order-26 isn't this waaaaay too big and almost never
>> satisfiable, despite direct/bg reclaim to be enabled? For now I've
>> reverted to using 3.12.52 kernel, where this issue hasn't been observed
>> (yet) any ideas would be much appreciated.
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 12:56   ` Nikolay Borisov
@ 2016-03-23 13:10     ` Brian Foster
  2016-03-23 15:03       ` Nikolay Borisov
  2016-03-23 23:00       ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Brian Foster @ 2016-03-23 13:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: xfs

On Wed, Mar 23, 2016 at 02:56:25PM +0200, Nikolay Borisov wrote:
> 
> 
> On 03/23/2016 02:43 PM, Brian Foster wrote:
> > On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
...
> > It looks like it's working to add a new extent to the in-core extent
> > list. If this is the stack associated with the warning message (combined
> > with the large alloc size), I wonder if there's a fragmentation issue on
> > the file leading to an excessive number of extents.
> 
> Yes this is the stack trace associated.
> 
> > 
> > What does 'xfs_bmap -v /storage/loop/file1' show?
> 
> It spews a lot of stuff but here is a summary, more detailed info can be
> provided if you need it:
> 
> xfs_bmap -v /storage/loop/file1 | wc -l
> 900908
> xfs_bmap -v /storage/loop/file1 | grep -c hole
> 94568
> 
> Also, what would constitute an "excessive number of extents"?
> 

I'm not sure where one would draw the line tbh, it's just a matter of
having too many extents to the point that it causes problems in terms of
performance (i.e., reading/modifying the extent list) or such as the
allocation problem you're running into. As it is, XFS maintains the full
extent list for an active inode in memory, so that's 800k+ extents that
it's looking for memory for.

It looks like that is your problem here. 800k or so extents over 878G
looks to be about 1MB per extent. Are you using extent size hints? One
option that might prevent this is to use a larger extent size hint
value. Another might be to preallocate the entire file up front with
fallocate. You'd probably have to experiment with what option or value
works best for your workload.

Brian

> > 
> > Brian
> > 
> >>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
> >>  [<ffffffffa0256269>] xfs_iext_insert+0x59/0x110 [xfs]
> >>  [<ffffffffa0230928>] ? xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
> >>  [<ffffffffa0230928>] xfs_bmap_add_extent_hole_delay+0xd8/0x740 [xfs]
> >>  [<ffffffff8112b5c5>] ? mempool_alloc_slab+0x15/0x20
> >>  [<ffffffff8112b725>] ? mempool_alloc+0x65/0x180
> >>  [<ffffffffa02543d8>] ? xfs_iext_get_ext+0x38/0x70 [xfs]
> >>  [<ffffffffa0254e8d>] ? xfs_iext_bno_to_ext+0xed/0x150 [xfs]
> >>  [<ffffffffa02311b5>] xfs_bmapi_reserve_delalloc+0x225/0x250 [xfs]
> >>  [<ffffffffa023131e>] xfs_bmapi_delay+0x13e/0x290 [xfs]
> >>  [<ffffffffa02730ad>] xfs_iomap_write_delay+0x17d/0x300 [xfs]
> >>  [<ffffffffa022e434>] ? xfs_bmapi_read+0x114/0x330 [xfs]
> >>  [<ffffffffa025ddc5>] __xfs_get_blocks+0x585/0xa90 [xfs]
> >>  [<ffffffff81324b53>] ? __percpu_counter_add+0x63/0x80
> >>  [<ffffffff811374cd>] ? account_page_dirtied+0xed/0x1b0
> >>  [<ffffffff811cfc59>] ? alloc_buffer_head+0x49/0x60
> >>  [<ffffffff811d07c0>] ? alloc_page_buffers+0x60/0xb0
> >>  [<ffffffff811d13e5>] ? create_empty_buffers+0x45/0xc0
> >>  [<ffffffffa025e324>] xfs_get_blocks+0x14/0x20 [xfs]
> >>  [<ffffffff811d34e2>] __block_write_begin+0x1c2/0x580
> >>  [<ffffffffa025e310>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
> >>  [<ffffffffa025bbb1>] xfs_vm_write_begin+0x61/0xf0 [xfs]
> >>  [<ffffffff81127e50>] generic_perform_write+0xd0/0x1f0
> >>  [<ffffffffa026a341>] xfs_file_buffered_aio_write+0xe1/0x240 [xfs]
> >>  [<ffffffff812e16d2>] ? bt_clear_tag+0xb2/0xd0
> >>  [<ffffffffa026ab87>] xfs_file_write_iter+0x167/0x170 [xfs]
> >>  [<ffffffff81199d76>] vfs_iter_write+0x76/0xa0
> >>  [<ffffffffa03fb735>] lo_write_bvec+0x65/0x100 [loop]
> >>  [<ffffffffa03fd589>] loop_queue_work+0x689/0x924 [loop]
> >>  [<ffffffff8163ba52>] ? retint_kernel+0x10/0x10
> >>  [<ffffffff81074d71>] kthread_worker_fn+0x61/0x1c0
> >>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
> >>  [<ffffffff81074d10>] ? flush_kthread_work+0x120/0x120
> >>  [<ffffffff810744d7>] kthread+0xd7/0xf0
> >>  [<ffffffff8107d22e>] ? schedule_tail+0x1e/0xd0
> >>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
> >>  [<ffffffff8163b2af>] ret_from_fork+0x3f/0x70
> >>  [<ffffffff81074400>] ? kthread_freezable_should_stop+0x80/0x80
> >>
> >> So this seems that there are writes to the loop device being queued and
> >> while being served XFS has to do some internal memory allocation to fit
> >> the new data, however due to some *uknown* reason it fails and starts
> >> looping in kmem_alloc.  I didn't see any OOM reports so presumably the
> >> server was not out of memory, but unfortunately I didn't check the
> >> memory fragmentation, though I collected a crash dump in case you need
> >> further info.
> >>
> >> The one thing which bugs me is that XFS tried to allocate 107 contiguous
> >> kb which is page-order-26 isn't this waaaaay too big and almost never
> >> satisfiable, despite direct/bg reclaim to be enabled? For now I've
> >> reverted to using 3.12.52 kernel, where this issue hasn't been observed
> >> (yet) any ideas would be much appreciated.
> >>
> >> _______________________________________________
> >> xfs mailing list
> >> xfs@oss.sgi.com
> >> http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 13:10     ` Brian Foster
@ 2016-03-23 15:03       ` Nikolay Borisov
  2016-03-23 16:58         ` Brian Foster
  2016-03-23 23:00       ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2016-03-23 15:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs



On 03/23/2016 03:10 PM, Brian Foster wrote:
> On Wed, Mar 23, 2016 at 02:56:25PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 03/23/2016 02:43 PM, Brian Foster wrote:
>>> On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
> ...
>>> It looks like it's working to add a new extent to the in-core extent
>>> list. If this is the stack associated with the warning message (combined
>>> with the large alloc size), I wonder if there's a fragmentation issue on
>>> the file leading to an excessive number of extents.
>>
>> Yes this is the stack trace associated.
>>
>>>
>>> What does 'xfs_bmap -v /storage/loop/file1' show?
>>
>> It spews a lot of stuff but here is a summary, more detailed info can be
>> provided if you need it:
>>
>> xfs_bmap -v /storage/loop/file1 | wc -l
>> 900908
>> xfs_bmap -v /storage/loop/file1 | grep -c hole
>> 94568
>>
>> Also, what would constitute an "excessive number of extents"?
>>
> 
> I'm not sure where one would draw the line tbh, it's just a matter of
> having too many extents to the point that it causes problems in terms of
> performance (i.e., reading/modifying the extent list) or such as the
> allocation problem you're running into. As it is, XFS maintains the full
> extent list for an active inode in memory, so that's 800k+ extents that
> it's looking for memory for.

I saw in the comments that this problem has already been identified and
a possible solution would be to add another level of indirection. Also,
can you confirm that my understanding of the operation of the
indirection array is correct in that each entry in the indirection array
xfs_ext_irec is responsible for 256 extents. (the er_extbuf is
PAGE_SIZE/4kb and an extent is 16 bytes which results in 256 extents)

> 
> It looks like that is your problem here. 800k or so extents over 878G
> looks to be about 1MB per extent. Are you using extent size hints? One
> option that might prevent this is to use a larger extent size hint
> value. Another might be to preallocate the entire file up front with
> fallocate. You'd probably have to experiment with what option or value
> works best for your workload.

By preallocating with fallocate you mean using fallocate with
FALLOC_FL_ZERO_RANGE and not FALLOC_FL_PUNCH_HOLE, right? Because as it
stands now the file does have holes, which presumably are being filled
and in order to be filled an extent has to be allocated which caused the
issue?  Am I right in this reasoning?

Currently I'm not using extents size hint but will look into that, also
if the extent size hint is say 4mb, wouldn't that cause a fairly serious
loss of space, provided that the writes are smaller than 4mb. Would XFS
try to perform some sort of extent coalescing or something else? I'm not
an FS developer but my understanding is that with a 4mb extent size,
whenever a new write occurs even if it's 256kb a new 4mb extent would be
allocated, no?

And a final question - when i printed the contents of the inode with
xfs_db I get core.nextents = 972564 whereas invoking the xfs_bmap | wc
-l on the file always gives varying numbers?

Thanks a lot for taking the time to reply.




_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 15:03       ` Nikolay Borisov
@ 2016-03-23 16:58         ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2016-03-23 16:58 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: xfs

On Wed, Mar 23, 2016 at 05:03:18PM +0200, Nikolay Borisov wrote:
...
> > I'm not sure where one would draw the line tbh, it's just a matter of
> > having too many extents to the point that it causes problems in terms of
> > performance (i.e., reading/modifying the extent list) or such as the
> > allocation problem you're running into. As it is, XFS maintains the full
> > extent list for an active inode in memory, so that's 800k+ extents that
> > it's looking for memory for.
> 
> I saw in the comments that this problem has already been identified and
> a possible solution would be to add another level of indirection. Also,
> can you confirm that my understanding of the operation of the
> indirection array is correct in that each entry in the indirection array
> xfs_ext_irec is responsible for 256 extents. (the er_extbuf is
> PAGE_SIZE/4kb and an extent is 16 bytes which results in 256 extents)
> 

That looks about right from the XFS_LINEAR_EXTS #define. I see the
comment but I've yet to really dig into the in-core extent list data
structures too deep to have any intuition or insight on a potential
solution (and don't really have time to atm). Dave or others might
already have an understanding of a limitation here.

> > 
> > It looks like that is your problem here. 800k or so extents over 878G
> > looks to be about 1MB per extent. Are you using extent size hints? One
> > option that might prevent this is to use a larger extent size hint
> > value. Another might be to preallocate the entire file up front with
> > fallocate. You'd probably have to experiment with what option or value
> > works best for your workload.
> 
> By preallocating with fallocate you mean using fallocate with
> FALLOC_FL_ZERO_RANGE and not FALLOC_FL_PUNCH_HOLE, right? Because as it
> stands now the file does have holes, which presumably are being filled
> and in order to be filled an extent has to be allocated which caused the
> issue?  Am I right in this reasoning?
> 

You don't need either, but definitely not hole punch. ;) See 'man 2
fallocate' for the default behavior (mode == 0). The idea is that the
allocation will occur with as large extents as possible, rather than
small, fragmented extents as writes occur. This is more reasonable if
you ultimately expect to use the entire file.

> Currently I'm not using extents size hint but will look into that, also
> if the extent size hint is say 4mb, wouldn't that cause a fairly serious
> loss of space, provided that the writes are smaller than 4mb. Would XFS
> try to perform some sort of extent coalescing or something else? I'm not
> an FS developer but my understanding is that with a 4mb extent size,
> whenever a new write occurs even if it's 256kb a new 4mb extent would be
> allocated, no?
> 

Yes, the extent size hint will "widen" allocations due to smaller writes
to the full hint size and alignment. This results in extra space usage
at first but reduces fragmentation over time as more of the file is
used. E.g., subsequent writes within that 4m range of your previous 256k
write will already have blocks allocated (as part of a larger,
contiguous extent).

The best bet is probably to experiment with your workload or look into
your current file layout and try to choose a value that reduces
fragmentation without sacrificing too much space efficiency.

> And a final question - when i printed the contents of the inode with
> xfs_db I get core.nextents = 972564 whereas invoking the xfs_bmap | wc
> -l on the file always gives varying numbers?
> 

I'd assume that the file is being actively modified..? I believe xfs_db
will read values from disk, which might not be coherent with the latest
in memory state, whereas bmap returns the latest layout of the file at
the time (which could also change again by the time bmap returns).

Brian

> Thanks a lot for taking the time to reply.
> 
> 
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 13:10     ` Brian Foster
  2016-03-23 15:03       ` Nikolay Borisov
@ 2016-03-23 23:00       ` Dave Chinner
  2016-03-24  9:20         ` Nikolay Borisov
  2016-03-24  9:31         ` Christoph Hellwig
  1 sibling, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2016-03-23 23:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: Nikolay Borisov, xfs

On Wed, Mar 23, 2016 at 09:10:59AM -0400, Brian Foster wrote:
> On Wed, Mar 23, 2016 at 02:56:25PM +0200, Nikolay Borisov wrote:
> > On 03/23/2016 02:43 PM, Brian Foster wrote:
> > > On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
> ...
> > > It looks like it's working to add a new extent to the in-core extent
> > > list. If this is the stack associated with the warning message (combined
> > > with the large alloc size), I wonder if there's a fragmentation issue on
> > > the file leading to an excessive number of extents.
> > 
> > Yes this is the stack trace associated.
> > 
> > > 
> > > What does 'xfs_bmap -v /storage/loop/file1' show?
> > 
> > It spews a lot of stuff but here is a summary, more detailed info can be
> > provided if you need it:
> > 
> > xfs_bmap -v /storage/loop/file1 | wc -l
> > 900908
> > xfs_bmap -v /storage/loop/file1 | grep -c hole
> > 94568
> > 
> > Also, what would constitute an "excessive number of extents"?
> > 
> 
> I'm not sure where one would draw the line tbh, it's just a matter of
> having too many extents to the point that it causes problems in terms of
> performance (i.e., reading/modifying the extent list) or such as the
> allocation problem you're running into. As it is, XFS maintains the full
> extent list for an active inode in memory, so that's 800k+ extents that
> it's looking for memory for.
> 
> It looks like that is your problem here. 800k or so extents over 878G
> looks to be about 1MB per extent.

Which I wouldn't call excessive. I use a 1MB extent size hint on all
my VM images as this allows the underlying device to do IOs large
enough to maintain clear to full bandwidth when reading and writing
regions of the underlying image file that are non-contiguous w.r.t.
sequential IO from the guest.

Mind you, it's not until I use ext4 or btrfs in the guests that I
actually see significant increases in extent size. Rule of thumb in
my testing is that if XFs creates 100k extents in the image file,
ext4 will create 500k, and btrfs will create somewhere between 1m
and 5m extents....

i.e. XFS as a guest filesystem gives results in much lower image
file fragmentation that the other options....

As it is, yes, the memory allocation problem is with the in-core
extent tree, and we've known about it for some time. The issue is
that as memory gets fragmented, the top level indirection array
grows too large to be allocated as a contiguous chunk. When this
happens really depends on memory load, uptime and the way the extent
tree is being modified.

I'm working on prototype patches to convert it to an in-memory btree
but they are far from ready at this point. This isn't straight
forward because all the extent management code assumes extents are
kept in a linear array and can be directly indexed by array offset
rather than file offset. I also want to make sure we can demand page
the extent list if necessary, and that also complicates things like
locking, as we currently assume the extent list is either completely
in memory or not in memory at all.

Fundamentally, I don't want to repeat the mistakes ext4 and btrfs
have made with their fine-grained in memory extent trees that are
based on rb-trees (e.g. global locks, shrinkers that don't scale or
consume way too much CPU, excessive memory consumption, etc) and so
solving all aspects of the problem in one go is somewhat complex.
And, of course, there's so much other stuff that needs to be done at
the same time, I cannot find much time to work on it at the
moment...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 23:00       ` Dave Chinner
@ 2016-03-24  9:20         ` Nikolay Borisov
  2016-03-24 21:58           ` Dave Chinner
  2016-03-24  9:31         ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2016-03-24  9:20 UTC (permalink / raw)
  To: Dave Chinner, Brian Foster; +Cc: xfs



On 03/24/2016 01:00 AM, Dave Chinner wrote:
> On Wed, Mar 23, 2016 at 09:10:59AM -0400, Brian Foster wrote:
>> On Wed, Mar 23, 2016 at 02:56:25PM +0200, Nikolay Borisov wrote:
>>> On 03/23/2016 02:43 PM, Brian Foster wrote:
>>>> On Wed, Mar 23, 2016 at 12:15:42PM +0200, Nikolay Borisov wrote:
>> ...
>>>> It looks like it's working to add a new extent to the in-core extent
>>>> list. If this is the stack associated with the warning message (combined
>>>> with the large alloc size), I wonder if there's a fragmentation issue on
>>>> the file leading to an excessive number of extents.
>>>
>>> Yes this is the stack trace associated.
>>>
>>>>
>>>> What does 'xfs_bmap -v /storage/loop/file1' show?
>>>
>>> It spews a lot of stuff but here is a summary, more detailed info can be
>>> provided if you need it:
>>>
>>> xfs_bmap -v /storage/loop/file1 | wc -l
>>> 900908
>>> xfs_bmap -v /storage/loop/file1 | grep -c hole
>>> 94568
>>>
>>> Also, what would constitute an "excessive number of extents"?
>>>
>>
>> I'm not sure where one would draw the line tbh, it's just a matter of
>> having too many extents to the point that it causes problems in terms of
>> performance (i.e., reading/modifying the extent list) or such as the
>> allocation problem you're running into. As it is, XFS maintains the full
>> extent list for an active inode in memory, so that's 800k+ extents that
>> it's looking for memory for.
>>
>> It looks like that is your problem here. 800k or so extents over 878G
>> looks to be about 1MB per extent.
> 
> Which I wouldn't call excessive. I use a 1MB extent size hint on all
> my VM images as this allows the underlying device to do IOs large
> enough to maintain clear to full bandwidth when reading and writing
> regions of the underlying image file that are non-contiguous w.r.t.
> sequential IO from the guest.
> 
> Mind you, it's not until I use ext4 or btrfs in the guests that I
> actually see significant increases in extent size. Rule of thumb in
> my testing is that if XFs creates 100k extents in the image file,
> ext4 will create 500k, and btrfs will create somewhere between 1m
> and 5m extents....
> 
> i.e. XFS as a guest filesystem gives results in much lower image
> file fragmentation that the other options....
> 
> As it is, yes, the memory allocation problem is with the in-core
> extent tree, and we've known about it for some time. The issue is
> that as memory gets fragmented, the top level indirection array
> grows too large to be allocated as a contiguous chunk. When this
> happens really depends on memory load, uptime and the way the extent
> tree is being modified.

And what about the following completely crazy idea of switching order >
3 allocations to using vmalloc? I know this would incur heavy
performance hit, but other than that would it cause correctness issues?
Of course I'm not saying this should be implemented in upstream rather
whether it's worth it having a go for experimenting with this idea.


> 
> I'm working on prototype patches to convert it to an in-memory btree
> but they are far from ready at this point. This isn't straight
> forward because all the extent management code assumes extents are
> kept in a linear array and can be directly indexed by array offset
> rather than file offset. I also want to make sure we can demand page
> the extent list if necessary, and that also complicates things like
> locking, as we currently assume the extent list is either completely
> in memory or not in memory at all.
> 
> Fundamentally, I don't want to repeat the mistakes ext4 and btrfs
> have made with their fine-grained in memory extent trees that are
> based on rb-trees (e.g. global locks, shrinkers that don't scale or
> consume way too much CPU, excessive memory consumption, etc) and so
> solving all aspects of the problem in one go is somewhat complex.
> And, of course, there's so much other stuff that needs to be done at
> the same time, I cannot find much time to work on it at the
> moment...
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 23:00       ` Dave Chinner
  2016-03-24  9:20         ` Nikolay Borisov
@ 2016-03-24  9:31         ` Christoph Hellwig
  2016-03-24 22:00           ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-03-24  9:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Nikolay Borisov, xfs

On Thu, Mar 24, 2016 at 10:00:02AM +1100, Dave Chinner wrote:
> I'm working on prototype patches to convert it to an in-memory btree
> but they are far from ready at this point. This isn't straight
> forward because all the extent management code assumes extents are
> kept in a linear array and can be directly indexed by array offset
> rather than file offset. I also want to make sure we can demand page
> the extent list if necessary, and that also complicates things like
> locking, as we currently assume the extent list is either completely
> in memory or not in memory at all.

FYI, I did patches to get rid almost all direct extent array access
a while ago, but I never bothered to post it as it seemed to much
churn.  Have you started that work yet or would it be useful
to dust those up again?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-23 10:15 Failing XFS memory allocation Nikolay Borisov
  2016-03-23 12:43 ` Brian Foster
@ 2016-03-24  9:33 ` Christoph Hellwig
  2016-03-24  9:42   ` Nikolay Borisov
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-03-24  9:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 242 bytes --]

Hi Nikolay,

can you give the patch below a spin?  While it doesn't solve the root
cause it makes many typical uses of kmem_realloc behave less badly,
so it should help with at least some of the less dramatic cases of very
fragmented files:


[-- Attachment #2: 0001-xfs-improve-kmem_realloc.patch --]
[-- Type: text/plain, Size: 4600 bytes --]

>From 4cfef0d21729704c79dc26621a254e507ea372a7 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 17 Mar 2016 11:15:59 +0100
Subject: xfs: improve kmem_realloc

Use krealloc to implement our realloc function.  This helps to avoid
new allocations if we are still in the slab bucket.  At least for the
bmap btree root that's actually the common case.

This also allows removing the now unused oldsize argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/kmem.c                  | 26 +++++++++++++++-----------
 fs/xfs/kmem.h                  |  2 +-
 fs/xfs/libxfs/xfs_inode_fork.c | 10 +++-------
 fs/xfs/xfs_log_recover.c       |  2 +-
 fs/xfs/xfs_mount.c             |  1 -
 5 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6f..339c696 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -93,19 +93,23 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 }
 
 void *
-kmem_realloc(const void *ptr, size_t newsize, size_t oldsize,
-	     xfs_km_flags_t flags)
+kmem_realloc(const void *old, size_t newsize, xfs_km_flags_t flags)
 {
-	void	*new;
+	int	retries = 0;
+	gfp_t	lflags = kmem_flags_convert(flags);
+	void	*ptr;
 
-	new = kmem_alloc(newsize, flags);
-	if (ptr) {
-		if (new)
-			memcpy(new, ptr,
-				((oldsize < newsize) ? oldsize : newsize));
-		kmem_free(ptr);
-	}
-	return new;
+	do {
+		ptr = krealloc(old, newsize, lflags);
+		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
+			return ptr;
+		if (!(++retries % 100))
+			xfs_err(NULL,
+	"%s(%u) possible memory allocation deadlock size %zu in %s (mode:0x%x)",
+				current->comm, current->pid,
+				newsize, __func__, lflags);
+		congestion_wait(BLK_RW_ASYNC, HZ/50);
+	} while (1);
 }
 
 void *
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e4..689f746 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -62,7 +62,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 
 extern void *kmem_alloc(size_t, xfs_km_flags_t);
 extern void *kmem_zalloc_large(size_t size, xfs_km_flags_t);
-extern void *kmem_realloc(const void *, size_t, size_t, xfs_km_flags_t);
+extern void *kmem_realloc(const void *, size_t, xfs_km_flags_t);
 static inline void  kmem_free(const void *ptr)
 {
 	kvfree(ptr);
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 4fbe226..d3d1477 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -542,7 +542,6 @@ xfs_iroot_realloc(
 		new_max = cur_max + rec_diff;
 		new_size = XFS_BMAP_BROOT_SPACE_CALC(mp, new_max);
 		ifp->if_broot = kmem_realloc(ifp->if_broot, new_size,
-				XFS_BMAP_BROOT_SPACE_CALC(mp, cur_max),
 				KM_SLEEP | KM_NOFS);
 		op = (char *)XFS_BMAP_BROOT_PTR_ADDR(mp, ifp->if_broot, 1,
 						     ifp->if_broot_bytes);
@@ -686,7 +685,6 @@ xfs_idata_realloc(
 				ifp->if_u1.if_data =
 					kmem_realloc(ifp->if_u1.if_data,
 							real_size,
-							ifp->if_real_bytes,
 							KM_SLEEP | KM_NOFS);
 			}
 		} else {
@@ -1402,8 +1400,7 @@ xfs_iext_realloc_direct(
 		if (rnew_size != ifp->if_real_bytes) {
 			ifp->if_u1.if_extents =
 				kmem_realloc(ifp->if_u1.if_extents,
-						rnew_size,
-						ifp->if_real_bytes, KM_NOFS);
+						rnew_size, KM_NOFS);
 		}
 		if (rnew_size > ifp->if_real_bytes) {
 			memset(&ifp->if_u1.if_extents[ifp->if_bytes /
@@ -1487,9 +1484,8 @@ xfs_iext_realloc_indirect(
 	if (new_size == 0) {
 		xfs_iext_destroy(ifp);
 	} else {
-		ifp->if_u1.if_ext_irec = (xfs_ext_irec_t *)
-			kmem_realloc(ifp->if_u1.if_ext_irec,
-				new_size, size, KM_NOFS);
+		ifp->if_u1.if_ext_irec =
+			kmem_realloc(ifp->if_u1.if_ext_irec, new_size, KM_NOFS);
 	}
 }
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 396565f..bf6e807 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3843,7 +3843,7 @@ xlog_recover_add_to_cont_trans(
 	old_ptr = item->ri_buf[item->ri_cnt-1].i_addr;
 	old_len = item->ri_buf[item->ri_cnt-1].i_len;
 
-	ptr = kmem_realloc(old_ptr, len+old_len, old_len, KM_SLEEP);
+	ptr = kmem_realloc(old_ptr, len + old_len, KM_SLEEP);
 	memcpy(&ptr[old_len], dp, len);
 	item->ri_buf[item->ri_cnt-1].i_len += len;
 	item->ri_buf[item->ri_cnt-1].i_addr = ptr;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 536a0ee..654799f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -89,7 +89,6 @@ xfs_uuid_mount(
 	if (hole < 0) {
 		xfs_uuid_table = kmem_realloc(xfs_uuid_table,
 			(xfs_uuid_table_size + 1) * sizeof(*xfs_uuid_table),
-			xfs_uuid_table_size  * sizeof(*xfs_uuid_table),
 			KM_SLEEP);
 		hole = xfs_uuid_table_size++;
 	}
-- 
2.1.4


[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-24  9:33 ` Christoph Hellwig
@ 2016-03-24  9:42   ` Nikolay Borisov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikolay Borisov @ 2016-03-24  9:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs



On 03/24/2016 11:33 AM, Christoph Hellwig wrote:
> Hi Nikolay,
> 
> can you give the patch below a spin?  While it doesn't solve the root
> cause it makes many typical uses of kmem_realloc behave less badly,
> so it should help with at least some of the less dramatic cases of very
> fragmented files:
> 

Sure, however I just checked some other servers with analogical setup
and there are files which even larger extents (the largest I saw was 2
millions) so I guess in this particular case the memory was fragmented
and compaction as invoked from the page allocator couldn't satisfy it.
So I don't know if it will help in my particular case but in any case I
will give it a go.

Thanks

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-24  9:20         ` Nikolay Borisov
@ 2016-03-24 21:58           ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-03-24 21:58 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Brian Foster, xfs

On Thu, Mar 24, 2016 at 11:20:23AM +0200, Nikolay Borisov wrote:
> On 03/24/2016 01:00 AM, Dave Chinner wrote:
> > As it is, yes, the memory allocation problem is with the in-core
> > extent tree, and we've known about it for some time. The issue is
> > that as memory gets fragmented, the top level indirection array
> > grows too large to be allocated as a contiguous chunk. When this
> > happens really depends on memory load, uptime and the way the extent
> > tree is being modified.
> 
> And what about the following completely crazy idea of switching order >
> 3 allocations to using vmalloc? I know this would incur heavy
> performance hit, but other than that would it cause correctness issues?
> Of course I'm not saying this should be implemented in upstream rather
> whether it's worth it having a go for experimenting with this idea.

It's not an option as many supported platforms which have extremely
limited vmalloc space.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Failing XFS memory allocation
  2016-03-24  9:31         ` Christoph Hellwig
@ 2016-03-24 22:00           ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-03-24 22:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, Nikolay Borisov, xfs

On Thu, Mar 24, 2016 at 02:31:27AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 24, 2016 at 10:00:02AM +1100, Dave Chinner wrote:
> > I'm working on prototype patches to convert it to an in-memory btree
> > but they are far from ready at this point. This isn't straight
> > forward because all the extent management code assumes extents are
> > kept in a linear array and can be directly indexed by array offset
> > rather than file offset. I also want to make sure we can demand page
> > the extent list if necessary, and that also complicates things like
> > locking, as we currently assume the extent list is either completely
> > in memory or not in memory at all.
> 
> FYI, I did patches to get rid almost all direct extent array access
> a while ago, but I never bothered to post it as it seemed to much
> churn.  Have you started that work yet or would it be useful
> to dust those up again?

I've done bits of it, but haven't completed it - send me the patches
and I'll see which approch makes the most sense...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-24 22:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-23 10:15 Failing XFS memory allocation Nikolay Borisov
2016-03-23 12:43 ` Brian Foster
2016-03-23 12:56   ` Nikolay Borisov
2016-03-23 13:10     ` Brian Foster
2016-03-23 15:03       ` Nikolay Borisov
2016-03-23 16:58         ` Brian Foster
2016-03-23 23:00       ` Dave Chinner
2016-03-24  9:20         ` Nikolay Borisov
2016-03-24 21:58           ` Dave Chinner
2016-03-24  9:31         ` Christoph Hellwig
2016-03-24 22:00           ` Dave Chinner
2016-03-24  9:33 ` Christoph Hellwig
2016-03-24  9:42   ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.