All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.10-rc4 stalls during mmap writes
@ 2013-06-07 19:37 ` Shawn Bohrer
  0 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-07 19:37 UTC (permalink / raw)
  To: xfs; +Cc: linux-kernel

I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
encounting some fairly large stalls in my memory mapped writes in the
range of .01 to 1s.  I've managed to capture two of these stalls so
far and both looked like the following:

1) Writing process writes to a new page and blocks on xfs_ilock:

    <...>-21567 [009]  9435.453069: sched_switch:     prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 next_prio=120
    <...>-21567 [009]  9435.453072: kernel_stack:     <stack trace>
=> schedule (ffffffff814ca379)
=> rwsem_down_write_failed (ffffffff814cb095)
=> call_rwsem_down_write_failed (ffffffff81275053)
=> xfs_ilock (ffffffff8120b25c)
=> xfs_vn_update_time (ffffffff811cf3d3)
=> update_time (ffffffff81158dd3)
=> file_update_time (ffffffff81158f0c)
=> block_page_mkwrite (ffffffff81171d23)
=> xfs_vm_page_mkwrite (ffffffff811c5375)
=> do_wp_page (ffffffff8110c27f)
=> handle_pte_fault (ffffffff8110dd24)
=> handle_mm_fault (ffffffff8110f430)
=> __do_page_fault (ffffffff814cef72)
=> do_page_fault (ffffffff814cf2e7)
=> page_fault (ffffffff814cbab2)

2) kworker calls xfs_iunlock and wakes up my process:

   kworker/u50:1-403   [013]  9436.027354: sched_wakeup:     comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
   kworker/u50:1-403   [013]  9436.027359: kernel_stack:     <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> __rwsem_do_wake (ffffffff8126e531)
=> rwsem_wake (ffffffff8126e62a)
=> call_rwsem_wake (ffffffff81275077)
=> xfs_iunlock (ffffffff8120b55c)
=> xfs_iomap_write_allocate (ffffffff811ce4e7)
=> xfs_map_blocks (ffffffff811bf145)
=> xfs_vm_writepage (ffffffff811bfbc2)
=> __writepage (ffffffff810f14e7)
=> write_cache_pages (ffffffff810f189e)
=> generic_writepages (ffffffff810f1b3a)
=> xfs_vm_writepages (ffffffff811bef8d)
=> do_writepages (ffffffff810f3380)
=> __writeback_single_inode (ffffffff81166ae5)
=> writeback_sb_inodes (ffffffff81167d4d)
=> __writeback_inodes_wb (ffffffff8116800e)
=> wb_writeback (ffffffff811682bb)
=> wb_check_old_data_flush (ffffffff811683ff)
=> wb_do_writeback (ffffffff81169bd1)
=> bdi_writeback_workfn (ffffffff81169cca)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

In this case my process stalled for roughly half a second:

<...>-21567 [009]  9436.027388: print:                tracing_mark_write: stall of 0.574282

So I guess my question is does anyone know why I'm now seeing these
stalls with 3.10?  Are there any suggestions for how to eliminate them?

# xfs_info /home/
meta-data=/dev/sda5              isize=256    agcount=4, agsize=67774016 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=271096064, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=132371, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# grep sda5 /proc/mounts 
/dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0

Thanks,
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 3.10-rc4 stalls during mmap writes
@ 2013-06-07 19:37 ` Shawn Bohrer
  0 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-07 19:37 UTC (permalink / raw)
  To: xfs; +Cc: linux-kernel

I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
encounting some fairly large stalls in my memory mapped writes in the
range of .01 to 1s.  I've managed to capture two of these stalls so
far and both looked like the following:

1) Writing process writes to a new page and blocks on xfs_ilock:

    <...>-21567 [009]  9435.453069: sched_switch:     prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 next_prio=120
    <...>-21567 [009]  9435.453072: kernel_stack:     <stack trace>
=> schedule (ffffffff814ca379)
=> rwsem_down_write_failed (ffffffff814cb095)
=> call_rwsem_down_write_failed (ffffffff81275053)
=> xfs_ilock (ffffffff8120b25c)
=> xfs_vn_update_time (ffffffff811cf3d3)
=> update_time (ffffffff81158dd3)
=> file_update_time (ffffffff81158f0c)
=> block_page_mkwrite (ffffffff81171d23)
=> xfs_vm_page_mkwrite (ffffffff811c5375)
=> do_wp_page (ffffffff8110c27f)
=> handle_pte_fault (ffffffff8110dd24)
=> handle_mm_fault (ffffffff8110f430)
=> __do_page_fault (ffffffff814cef72)
=> do_page_fault (ffffffff814cf2e7)
=> page_fault (ffffffff814cbab2)

2) kworker calls xfs_iunlock and wakes up my process:

   kworker/u50:1-403   [013]  9436.027354: sched_wakeup:     comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
   kworker/u50:1-403   [013]  9436.027359: kernel_stack:     <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> __rwsem_do_wake (ffffffff8126e531)
=> rwsem_wake (ffffffff8126e62a)
=> call_rwsem_wake (ffffffff81275077)
=> xfs_iunlock (ffffffff8120b55c)
=> xfs_iomap_write_allocate (ffffffff811ce4e7)
=> xfs_map_blocks (ffffffff811bf145)
=> xfs_vm_writepage (ffffffff811bfbc2)
=> __writepage (ffffffff810f14e7)
=> write_cache_pages (ffffffff810f189e)
=> generic_writepages (ffffffff810f1b3a)
=> xfs_vm_writepages (ffffffff811bef8d)
=> do_writepages (ffffffff810f3380)
=> __writeback_single_inode (ffffffff81166ae5)
=> writeback_sb_inodes (ffffffff81167d4d)
=> __writeback_inodes_wb (ffffffff8116800e)
=> wb_writeback (ffffffff811682bb)
=> wb_check_old_data_flush (ffffffff811683ff)
=> wb_do_writeback (ffffffff81169bd1)
=> bdi_writeback_workfn (ffffffff81169cca)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

In this case my process stalled for roughly half a second:

<...>-21567 [009]  9436.027388: print:                tracing_mark_write: stall of 0.574282

So I guess my question is does anyone know why I'm now seeing these
stalls with 3.10?  Are there any suggestions for how to eliminate them?

# xfs_info /home/
meta-data=/dev/sda5              isize=256    agcount=4, agsize=67774016 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=271096064, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=132371, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# grep sda5 /proc/mounts 
/dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0

Thanks,
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
  2013-06-07 19:37 ` Shawn Bohrer
@ 2013-06-09  3:37   ` Dave Chinner
  -1 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-09  3:37 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: xfs, linux-kernel

On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
> encounting some fairly large stalls in my memory mapped writes in the
> range of .01 to 1s.  I've managed to capture two of these stalls so
> far and both looked like the following:
> 
> 1) Writing process writes to a new page and blocks on xfs_ilock:
> 
>     <...>-21567 [009]  9435.453069: sched_switch:     prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 next_prio=120
>     <...>-21567 [009]  9435.453072: kernel_stack:     <stack trace>
> => schedule (ffffffff814ca379)
> => rwsem_down_write_failed (ffffffff814cb095)
> => call_rwsem_down_write_failed (ffffffff81275053)
> => xfs_ilock (ffffffff8120b25c)
> => xfs_vn_update_time (ffffffff811cf3d3)
> => update_time (ffffffff81158dd3)
> => file_update_time (ffffffff81158f0c)
> => block_page_mkwrite (ffffffff81171d23)
> => xfs_vm_page_mkwrite (ffffffff811c5375)
> => do_wp_page (ffffffff8110c27f)
> => handle_pte_fault (ffffffff8110dd24)
> => handle_mm_fault (ffffffff8110f430)
> => __do_page_fault (ffffffff814cef72)
> => do_page_fault (ffffffff814cf2e7)
> => page_fault (ffffffff814cbab2)

Changing C/MTIME on the inode. Needs a lock, the update is
transactional.

> 
> 2) kworker calls xfs_iunlock and wakes up my process:
> 
>    kworker/u50:1-403   [013]  9436.027354: sched_wakeup:     comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
>    kworker/u50:1-403   [013]  9436.027359: kernel_stack:     <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => __rwsem_do_wake (ffffffff8126e531)
> => rwsem_wake (ffffffff8126e62a)
> => call_rwsem_wake (ffffffff81275077)
> => xfs_iunlock (ffffffff8120b55c)
> => xfs_iomap_write_allocate (ffffffff811ce4e7)
> => xfs_map_blocks (ffffffff811bf145)
> => xfs_vm_writepage (ffffffff811bfbc2)

And allocation during writeback is holding the lock on that inode
as it's already in a transaction.

> So I guess my question is does anyone know why I'm now seeing these
> stalls with 3.10?

Because we made all metadata updates in XFS fully transactional in
3.4:

commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed Feb 29 09:53:52 2012 +0000

    xfs: log timestamp updates
    
    Timestamps on regular files are the last metadata that XFS does not update
    transactionally.  Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates.  Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.
    
    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.
    
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mark Tinguely <tinguely@sgi.com>
    Signed-off-by: Ben Myers <bpm@sgi.com>
$ git describe --contains 8a9c998
v3.4-rc1~55^2~23

IOWs, you're just lucky you haven't noticed it on 3.4....

> Are there any suggestions for how to eliminate them?

Nope. You're stuck with it - there's far more places in the page
fault path where you can get stuck on the same lock for the same
reason - e.g. during block mapping for the newly added pagecache
page...

Hint: mmap() does not provide -deterministic- low latency access to
mapped pages - it is only "mostly low latency". mmap() has exactly
the same worst case page fault latencies as the equivalent write()
syscall. e.g., dirty too many pages and mmap() write page faults
can be throttled, just like a write() syscall....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
@ 2013-06-09  3:37   ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-09  3:37 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: linux-kernel, xfs

On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
> encounting some fairly large stalls in my memory mapped writes in the
> range of .01 to 1s.  I've managed to capture two of these stalls so
> far and both looked like the following:
> 
> 1) Writing process writes to a new page and blocks on xfs_ilock:
> 
>     <...>-21567 [009]  9435.453069: sched_switch:     prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 next_prio=120
>     <...>-21567 [009]  9435.453072: kernel_stack:     <stack trace>
> => schedule (ffffffff814ca379)
> => rwsem_down_write_failed (ffffffff814cb095)
> => call_rwsem_down_write_failed (ffffffff81275053)
> => xfs_ilock (ffffffff8120b25c)
> => xfs_vn_update_time (ffffffff811cf3d3)
> => update_time (ffffffff81158dd3)
> => file_update_time (ffffffff81158f0c)
> => block_page_mkwrite (ffffffff81171d23)
> => xfs_vm_page_mkwrite (ffffffff811c5375)
> => do_wp_page (ffffffff8110c27f)
> => handle_pte_fault (ffffffff8110dd24)
> => handle_mm_fault (ffffffff8110f430)
> => __do_page_fault (ffffffff814cef72)
> => do_page_fault (ffffffff814cf2e7)
> => page_fault (ffffffff814cbab2)

Changing C/MTIME on the inode. Needs a lock, the update is
transactional.

> 
> 2) kworker calls xfs_iunlock and wakes up my process:
> 
>    kworker/u50:1-403   [013]  9436.027354: sched_wakeup:     comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
>    kworker/u50:1-403   [013]  9436.027359: kernel_stack:     <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => __rwsem_do_wake (ffffffff8126e531)
> => rwsem_wake (ffffffff8126e62a)
> => call_rwsem_wake (ffffffff81275077)
> => xfs_iunlock (ffffffff8120b55c)
> => xfs_iomap_write_allocate (ffffffff811ce4e7)
> => xfs_map_blocks (ffffffff811bf145)
> => xfs_vm_writepage (ffffffff811bfbc2)

And allocation during writeback is holding the lock on that inode
as it's already in a transaction.

> So I guess my question is does anyone know why I'm now seeing these
> stalls with 3.10?

Because we made all metadata updates in XFS fully transactional in
3.4:

commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed Feb 29 09:53:52 2012 +0000

    xfs: log timestamp updates
    
    Timestamps on regular files are the last metadata that XFS does not update
    transactionally.  Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates.  Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.
    
    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.
    
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mark Tinguely <tinguely@sgi.com>
    Signed-off-by: Ben Myers <bpm@sgi.com>
$ git describe --contains 8a9c998
v3.4-rc1~55^2~23

IOWs, you're just lucky you haven't noticed it on 3.4....

> Are there any suggestions for how to eliminate them?

Nope. You're stuck with it - there's far more places in the page
fault path where you can get stuck on the same lock for the same
reason - e.g. during block mapping for the newly added pagecache
page...

Hint: mmap() does not provide -deterministic- low latency access to
mapped pages - it is only "mostly low latency". mmap() has exactly
the same worst case page fault latencies as the equivalent write()
syscall. e.g., dirty too many pages and mmap() write page faults
can be throttled, just like a write() syscall....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
  2013-06-09  3:37   ` Dave Chinner
@ 2013-06-10 18:45     ` Shawn Bohrer
  -1 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-10 18:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-kernel

On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So I guess my question is does anyone know why I'm now seeing these
> > stalls with 3.10?
> 
> Because we made all metadata updates in XFS fully transactional in
> 3.4:
> 
> commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> Author: Christoph Hellwig <hch@infradead.org>
> Date:   Wed Feb 29 09:53:52 2012 +0000
> 
>     xfs: log timestamp updates
>     
>     Timestamps on regular files are the last metadata that XFS does not update
>     transactionally.  Now that we use the delaylog mode exclusively and made
>     the log scode scale extremly well there is no need to bypass that code for
>     timestamp updates.  Logging all updates allows to drop a lot of code, and
>     will allow for further performance improvements later on.
>     
>     Note that this patch drops optimized handling of fdatasync - it will be
>     added back in a separate commit.
>     
>     Reviewed-by: Dave Chinner <dchinner@redhat.com>
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Mark Tinguely <tinguely@sgi.com>
>     Signed-off-by: Ben Myers <bpm@sgi.com>
> $ git describe --contains 8a9c998
> v3.4-rc1~55^2~23
> 
> IOWs, you're just lucky you haven't noticed it on 3.4....
> 
> > Are there any suggestions for how to eliminate them?
> 
> Nope. You're stuck with it - there's far more places in the page
> fault path where you can get stuck on the same lock for the same
> reason - e.g. during block mapping for the newly added pagecache
> page...
> 
> Hint: mmap() does not provide -deterministic- low latency access to
> mapped pages - it is only "mostly low latency".

Hi Dave, I appreciate your time and analysis.  I am sadly aware that
doing file I/O and expecting low latency is a bit of a stretch.  This
is also not the first time I've battled these types of stalls and
realize that the best I can do is search for opportunities to reduce
the probability of a stall, or find ways to reduce the duration of a
stall.

In this case since updating timestamps has been transactional and in
place since 3.4, it is obvious to me that this is not the cause of
both the increased rate and duration of the stalls on 3.10.  I assure
you on 3.4 we have very few stalls that are greater than 10
milliseconds in our normal workload and with 3.10 I'm seeing them
regularly.  Since we know I can, and likely do, hit the same code path
on 3.4 that tells me that me that the xfs_ilock() is likely being held
for a longer duration on the new kernel.  Let's see if we can
determine why the lock is held for so long.

Here is an attempt at that, as events unfold in time order.  I'm in no
way a filesystem developer so any input or analysis of what we're
waiting on is appreciated.  There is also multiple kworker threads
involved which certainly complicates things.


It starts with kworker/u49:0 which aquires xfs_ilock() inside
iomap_write_allocate().


   kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:                   |  xfs_iomap_write_allocate() {
   kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:        0.074 us   |    xfs_ilock();


In the next two chunks it appears that kworker/u49:0 calls
xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
__xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().


     kworker/4:0-27520 [004] 256032.180389: sched_switch:         prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120
     kworker/4:0-27520 [004] 256032.180393: kernel_stack:         <stack trace>
=> schedule (ffffffff814ca379)
=> schedule_timeout (ffffffff814c810d)
=> __down_common (ffffffff814c8e5e)
=> __down (ffffffff814c8f26)
=> down (ffffffff810658e1)
=> xfs_buf_lock (ffffffff811c12a4)
=> _xfs_buf_find (ffffffff811c1469)
=> xfs_buf_get_map (ffffffff811c16e4)
=> xfs_buf_read_map (ffffffff811c2691)
=> xfs_trans_read_buf_map (ffffffff81225fa9)
=> xfs_btree_read_buf_block.constprop.6 (ffffffff811f2242)
=> xfs_btree_lookup_get_block (ffffffff811f22fb)
=> xfs_btree_lookup (ffffffff811f6707)
=> xfs_alloc_ag_vextent_near (ffffffff811d9d52)
=> xfs_alloc_ag_vextent (ffffffff811da8b5)
=> xfs_alloc_vextent (ffffffff811db545)
=> xfs_bmap_btalloc (ffffffff811e6951)
=> xfs_bmap_alloc (ffffffff811e6dee)
=> __xfs_bmapi_allocate (ffffffff811ec024)
=> xfs_bmapi_allocate_worker (ffffffff811ec283)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

kworker/u49:0-15748 [004] 256032.180403: sched_switch:         prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100
   kworker/u49:0-15748 [004] 256032.180408: kernel_stack:         <stack trace>
=> schedule (ffffffff814ca379)
=> schedule_timeout (ffffffff814c810d)
=> wait_for_completion (ffffffff814ca2e5)
=> xfs_bmapi_allocate (ffffffff811eea57)
=> xfs_bmapi_write (ffffffff811eef33)
=> xfs_iomap_write_allocate (ffffffff811ce49e)
=> return_to_handler (ffffffff814d3845)
=> xfs_map_blocks (ffffffff811bf145)
=> xfs_vm_writepage (ffffffff811bfbc2)
=> __writepage (ffffffff810f14e7)
=> write_cache_pages (ffffffff810f189e)
=> generic_writepages (ffffffff810f1b3a)
=> xfs_vm_writepages (ffffffff811bef8d)
=> do_writepages (ffffffff810f3380)
=> __writeback_single_inode (ffffffff81166ae5)
=> writeback_sb_inodes (ffffffff81167d4d)
=> __writeback_inodes_wb (ffffffff8116800e)
=> wb_writeback (ffffffff811682bb)
=> wb_check_old_data_flush (ffffffff811683ff)
=> wb_do_writeback (ffffffff81169bd1)
=> bdi_writeback_workfn (ffffffff81169cca)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)


Here I have kworker/2:1H which appears to get its work scheduled from
a softirq.

          <idle>-0     [002] 256032.890594: sched_wakeup:         comm=kworker/2:1H pid=2176 prio=100 success=1 target_cpu=002
          <idle>-0     [002] 256032.890598: kernel_stack:         <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> wake_up_worker (ffffffff810568c4)
=> insert_work (ffffffff8105821d)
=> __queue_work (ffffffff81058391)
=> queue_work_on (ffffffff81058605)
=> xfs_buf_ioend (ffffffff811c1a37)
=> _xfs_buf_ioend.constprop.1 (ffffffff811c1aaa)
=> xfs_buf_bio_end_io (ffffffff811c1f58)
=> bio_endio (ffffffff8117362d)
=> blk_update_request (ffffffff8124f1b8)
=> blk_update_bidi_request (ffffffff8124f511)
=> blk_end_bidi_request (ffffffff8124f59c)
=> blk_end_request (ffffffff8124f630)
=> scsi_io_completion (ffffffff8136156f)
=> scsi_finish_command (ffffffff81357e11)
=> scsi_softirq_done (ffffffff813613bf)
=> blk_done_softirq (ffffffff81255fe3)
=> __do_softirq (ffffffff81044230)
=> irq_exit (ffffffff81044505)
=> smp_call_function_single_interrupt (ffffffff81024475)
=> call_function_single_interrupt (ffffffff814d477a)
=> arch_cpu_idle (ffffffff8100ad8e)
=> cpu_startup_entry (ffffffff8107b39e)
=> start_secondary (ffffffff81b259f3)
          <idle>-0     [002] 256032.890603: sched_switch:         prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/2:1H next_pid=2176 next_prio=100


kworker/2:1H calls xfs_buf_unlock() which then wakes up kworker/4:0


    kworker/2:1H-2176  [002] 256032.890607: sched_wakeup:         comm=kworker/4:0 pid=27520 prio=120 success=1 target_cpu=004
    kworker/2:1H-2176  [002] 256032.890609: kernel_stack:         <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> __up (ffffffff814c8dce)
=> up (ffffffff81065762)
=> xfs_buf_unlock (ffffffff811c163d)
=> xfs_buf_iodone_work (ffffffff811c1908)
=> xfs_buf_ioend (ffffffff811c198b)
=> xfs_buf_iodone_callbacks (ffffffff812222b6)
=> xfs_buf_iodone_work (ffffffff811c18be)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

kworker/4:0 wakes up kworker/u49:0 which which releases xfs_iunlock()
ending the delay.

     kworker/4:0-27520 [004] 256032.890646: sched_wakeup:         comm=kworker/u49:0 pid=15748 prio=120 success=1 target_cpu=006
   kworker/u49:0-15748 [006] 256032.890665: funcgraph_entry:        1.143 us   |    xfs_iunlock();
   kworker/u49:0-15748 [006] 256032.890666: funcgraph_exit:       ! 710304.855 us |  }


So to summarize it appears that most of the time was spent with
kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
not missing some important intermediate steps does this provide any
more information about what resource I'm actually waiting for?  Does
this point to any changes that happened after 3.4?  Are there any tips
that could help minimize these contentions?

Thanks,
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
@ 2013-06-10 18:45     ` Shawn Bohrer
  0 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-10 18:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs

On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So I guess my question is does anyone know why I'm now seeing these
> > stalls with 3.10?
> 
> Because we made all metadata updates in XFS fully transactional in
> 3.4:
> 
> commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> Author: Christoph Hellwig <hch@infradead.org>
> Date:   Wed Feb 29 09:53:52 2012 +0000
> 
>     xfs: log timestamp updates
>     
>     Timestamps on regular files are the last metadata that XFS does not update
>     transactionally.  Now that we use the delaylog mode exclusively and made
>     the log scode scale extremly well there is no need to bypass that code for
>     timestamp updates.  Logging all updates allows to drop a lot of code, and
>     will allow for further performance improvements later on.
>     
>     Note that this patch drops optimized handling of fdatasync - it will be
>     added back in a separate commit.
>     
>     Reviewed-by: Dave Chinner <dchinner@redhat.com>
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Mark Tinguely <tinguely@sgi.com>
>     Signed-off-by: Ben Myers <bpm@sgi.com>
> $ git describe --contains 8a9c998
> v3.4-rc1~55^2~23
> 
> IOWs, you're just lucky you haven't noticed it on 3.4....
> 
> > Are there any suggestions for how to eliminate them?
> 
> Nope. You're stuck with it - there's far more places in the page
> fault path where you can get stuck on the same lock for the same
> reason - e.g. during block mapping for the newly added pagecache
> page...
> 
> Hint: mmap() does not provide -deterministic- low latency access to
> mapped pages - it is only "mostly low latency".

Hi Dave, I appreciate your time and analysis.  I am sadly aware that
doing file I/O and expecting low latency is a bit of a stretch.  This
is also not the first time I've battled these types of stalls and
realize that the best I can do is search for opportunities to reduce
the probability of a stall, or find ways to reduce the duration of a
stall.

In this case since updating timestamps has been transactional and in
place since 3.4, it is obvious to me that this is not the cause of
both the increased rate and duration of the stalls on 3.10.  I assure
you on 3.4 we have very few stalls that are greater than 10
milliseconds in our normal workload and with 3.10 I'm seeing them
regularly.  Since we know I can, and likely do, hit the same code path
on 3.4 that tells me that me that the xfs_ilock() is likely being held
for a longer duration on the new kernel.  Let's see if we can
determine why the lock is held for so long.

Here is an attempt at that, as events unfold in time order.  I'm in no
way a filesystem developer so any input or analysis of what we're
waiting on is appreciated.  There is also multiple kworker threads
involved which certainly complicates things.


It starts with kworker/u49:0 which aquires xfs_ilock() inside
iomap_write_allocate().


   kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:                   |  xfs_iomap_write_allocate() {
   kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:        0.074 us   |    xfs_ilock();


In the next two chunks it appears that kworker/u49:0 calls
xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
__xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().


     kworker/4:0-27520 [004] 256032.180389: sched_switch:         prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120
     kworker/4:0-27520 [004] 256032.180393: kernel_stack:         <stack trace>
=> schedule (ffffffff814ca379)
=> schedule_timeout (ffffffff814c810d)
=> __down_common (ffffffff814c8e5e)
=> __down (ffffffff814c8f26)
=> down (ffffffff810658e1)
=> xfs_buf_lock (ffffffff811c12a4)
=> _xfs_buf_find (ffffffff811c1469)
=> xfs_buf_get_map (ffffffff811c16e4)
=> xfs_buf_read_map (ffffffff811c2691)
=> xfs_trans_read_buf_map (ffffffff81225fa9)
=> xfs_btree_read_buf_block.constprop.6 (ffffffff811f2242)
=> xfs_btree_lookup_get_block (ffffffff811f22fb)
=> xfs_btree_lookup (ffffffff811f6707)
=> xfs_alloc_ag_vextent_near (ffffffff811d9d52)
=> xfs_alloc_ag_vextent (ffffffff811da8b5)
=> xfs_alloc_vextent (ffffffff811db545)
=> xfs_bmap_btalloc (ffffffff811e6951)
=> xfs_bmap_alloc (ffffffff811e6dee)
=> __xfs_bmapi_allocate (ffffffff811ec024)
=> xfs_bmapi_allocate_worker (ffffffff811ec283)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

kworker/u49:0-15748 [004] 256032.180403: sched_switch:         prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100
   kworker/u49:0-15748 [004] 256032.180408: kernel_stack:         <stack trace>
=> schedule (ffffffff814ca379)
=> schedule_timeout (ffffffff814c810d)
=> wait_for_completion (ffffffff814ca2e5)
=> xfs_bmapi_allocate (ffffffff811eea57)
=> xfs_bmapi_write (ffffffff811eef33)
=> xfs_iomap_write_allocate (ffffffff811ce49e)
=> return_to_handler (ffffffff814d3845)
=> xfs_map_blocks (ffffffff811bf145)
=> xfs_vm_writepage (ffffffff811bfbc2)
=> __writepage (ffffffff810f14e7)
=> write_cache_pages (ffffffff810f189e)
=> generic_writepages (ffffffff810f1b3a)
=> xfs_vm_writepages (ffffffff811bef8d)
=> do_writepages (ffffffff810f3380)
=> __writeback_single_inode (ffffffff81166ae5)
=> writeback_sb_inodes (ffffffff81167d4d)
=> __writeback_inodes_wb (ffffffff8116800e)
=> wb_writeback (ffffffff811682bb)
=> wb_check_old_data_flush (ffffffff811683ff)
=> wb_do_writeback (ffffffff81169bd1)
=> bdi_writeback_workfn (ffffffff81169cca)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)


Here I have kworker/2:1H which appears to get its work scheduled from
a softirq.

          <idle>-0     [002] 256032.890594: sched_wakeup:         comm=kworker/2:1H pid=2176 prio=100 success=1 target_cpu=002
          <idle>-0     [002] 256032.890598: kernel_stack:         <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> wake_up_worker (ffffffff810568c4)
=> insert_work (ffffffff8105821d)
=> __queue_work (ffffffff81058391)
=> queue_work_on (ffffffff81058605)
=> xfs_buf_ioend (ffffffff811c1a37)
=> _xfs_buf_ioend.constprop.1 (ffffffff811c1aaa)
=> xfs_buf_bio_end_io (ffffffff811c1f58)
=> bio_endio (ffffffff8117362d)
=> blk_update_request (ffffffff8124f1b8)
=> blk_update_bidi_request (ffffffff8124f511)
=> blk_end_bidi_request (ffffffff8124f59c)
=> blk_end_request (ffffffff8124f630)
=> scsi_io_completion (ffffffff8136156f)
=> scsi_finish_command (ffffffff81357e11)
=> scsi_softirq_done (ffffffff813613bf)
=> blk_done_softirq (ffffffff81255fe3)
=> __do_softirq (ffffffff81044230)
=> irq_exit (ffffffff81044505)
=> smp_call_function_single_interrupt (ffffffff81024475)
=> call_function_single_interrupt (ffffffff814d477a)
=> arch_cpu_idle (ffffffff8100ad8e)
=> cpu_startup_entry (ffffffff8107b39e)
=> start_secondary (ffffffff81b259f3)
          <idle>-0     [002] 256032.890603: sched_switch:         prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/2:1H next_pid=2176 next_prio=100


kworker/2:1H calls xfs_buf_unlock() which then wakes up kworker/4:0


    kworker/2:1H-2176  [002] 256032.890607: sched_wakeup:         comm=kworker/4:0 pid=27520 prio=120 success=1 target_cpu=004
    kworker/2:1H-2176  [002] 256032.890609: kernel_stack:         <stack trace>
=> ttwu_do_activate.constprop.34 (ffffffff8106c556)
=> try_to_wake_up (ffffffff8106e996)
=> wake_up_process (ffffffff8106ea87)
=> __up (ffffffff814c8dce)
=> up (ffffffff81065762)
=> xfs_buf_unlock (ffffffff811c163d)
=> xfs_buf_iodone_work (ffffffff811c1908)
=> xfs_buf_ioend (ffffffff811c198b)
=> xfs_buf_iodone_callbacks (ffffffff812222b6)
=> xfs_buf_iodone_work (ffffffff811c18be)
=> process_one_work (ffffffff81059104)
=> worker_thread (ffffffff8105a1bc)
=> kthread (ffffffff810605f0)
=> ret_from_fork (ffffffff814d395c)

kworker/4:0 wakes up kworker/u49:0 which which releases xfs_iunlock()
ending the delay.

     kworker/4:0-27520 [004] 256032.890646: sched_wakeup:         comm=kworker/u49:0 pid=15748 prio=120 success=1 target_cpu=006
   kworker/u49:0-15748 [006] 256032.890665: funcgraph_entry:        1.143 us   |    xfs_iunlock();
   kworker/u49:0-15748 [006] 256032.890666: funcgraph_exit:       ! 710304.855 us |  }


So to summarize it appears that most of the time was spent with
kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
not missing some important intermediate steps does this provide any
more information about what resource I'm actually waiting for?  Does
this point to any changes that happened after 3.4?  Are there any tips
that could help minimize these contentions?

Thanks,
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
  2013-06-10 18:45     ` Shawn Bohrer
@ 2013-06-10 20:53       ` Dave Chinner
  -1 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-10 20:53 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: xfs, linux-kernel

On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > > So I guess my question is does anyone know why I'm now seeing these
> > > stalls with 3.10?
> > 
> > Because we made all metadata updates in XFS fully transactional in
> > 3.4:
> > 
> > commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> > Author: Christoph Hellwig <hch@infradead.org>
> > Date:   Wed Feb 29 09:53:52 2012 +0000
> > 
> >     xfs: log timestamp updates
> >     
> >     Timestamps on regular files are the last metadata that XFS does not update
> >     transactionally.  Now that we use the delaylog mode exclusively and made
> >     the log scode scale extremly well there is no need to bypass that code for
> >     timestamp updates.  Logging all updates allows to drop a lot of code, and
> >     will allow for further performance improvements later on.
> >     
> >     Note that this patch drops optimized handling of fdatasync - it will be
> >     added back in a separate commit.
> >     
> >     Reviewed-by: Dave Chinner <dchinner@redhat.com>
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> >     Signed-off-by: Ben Myers <bpm@sgi.com>
> > $ git describe --contains 8a9c998
> > v3.4-rc1~55^2~23
> > 
> > IOWs, you're just lucky you haven't noticed it on 3.4....
> > 
> > > Are there any suggestions for how to eliminate them?
> > 
> > Nope. You're stuck with it - there's far more places in the page
> > fault path where you can get stuck on the same lock for the same
> > reason - e.g. during block mapping for the newly added pagecache
> > page...
> > 
> > Hint: mmap() does not provide -deterministic- low latency access to
> > mapped pages - it is only "mostly low latency".
> 
> Hi Dave, I appreciate your time and analysis.  I am sadly aware that
> doing file I/O and expecting low latency is a bit of a stretch.  This
> is also not the first time I've battled these types of stalls and
> realize that the best I can do is search for opportunities to reduce
> the probability of a stall, or find ways to reduce the duration of a
> stall.
> 
> In this case since updating timestamps has been transactional and in
> place since 3.4, it is obvious to me that this is not the cause of
> both the increased rate and duration of the stalls on 3.10.  I assure
> you on 3.4 we have very few stalls that are greater than 10
> milliseconds in our normal workload and with 3.10 I'm seeing them
> regularly.  Since we know I can, and likely do, hit the same code path
> on 3.4 that tells me that me that the xfs_ilock() is likely being held
> for a longer duration on the new kernel.  Let's see if we can
> determine why the lock is held for so long.
> 
> Here is an attempt at that, as events unfold in time order.  I'm in no
> way a filesystem developer so any input or analysis of what we're
> waiting on is appreciated.  There is also multiple kworker threads
> involved which certainly complicates things.
> 
> 
> It starts with kworker/u49:0 which aquires xfs_ilock() inside
> iomap_write_allocate().
> 
> 
>    kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:                   |  xfs_iomap_write_allocate() {
>    kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:        0.074 us   |    xfs_ilock();
> 
> 
> In the next two chunks it appears that kworker/u49:0 calls
> xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
> __xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().
> 
> 
>      kworker/4:0-27520 [004] 256032.180389: sched_switch:         prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120
>      kworker/4:0-27520 [004] 256032.180393: kernel_stack:         <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => __down_common (ffffffff814c8e5e)
> => __down (ffffffff814c8f26)
> => down (ffffffff810658e1)
> => xfs_buf_lock (ffffffff811c12a4)
> => _xfs_buf_find (ffffffff811c1469)
> => xfs_buf_get_map (ffffffff811c16e4)
> => xfs_buf_read_map (ffffffff811c2691)
> => xfs_trans_read_buf_map (ffffffff81225fa9)
> => xfs_btree_read_buf_block.constprop.6 (ffffffff811f2242)
> => xfs_btree_lookup_get_block (ffffffff811f22fb)
> => xfs_btree_lookup (ffffffff811f6707)
> => xfs_alloc_ag_vextent_near (ffffffff811d9d52)
> => xfs_alloc_ag_vextent (ffffffff811da8b5)
> => xfs_alloc_vextent (ffffffff811db545)
> => xfs_bmap_btalloc (ffffffff811e6951)
> => xfs_bmap_alloc (ffffffff811e6dee)
> => __xfs_bmapi_allocate (ffffffff811ec024)
> => xfs_bmapi_allocate_worker (ffffffff811ec283)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

It has blocked waiting for freespace btree IO to complete....

> kworker/u49:0-15748 [004] 256032.180403: sched_switch:         prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100
>    kworker/u49:0-15748 [004] 256032.180408: kernel_stack:         <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => wait_for_completion (ffffffff814ca2e5)
> => xfs_bmapi_allocate (ffffffff811eea57)
> => xfs_bmapi_write (ffffffff811eef33)
> => xfs_iomap_write_allocate (ffffffff811ce49e)
> => return_to_handler (ffffffff814d3845)
> => xfs_map_blocks (ffffffff811bf145)
> => xfs_vm_writepage (ffffffff811bfbc2)
> => __writepage (ffffffff810f14e7)
> => write_cache_pages (ffffffff810f189e)
> => generic_writepages (ffffffff810f1b3a)
> => xfs_vm_writepages (ffffffff811bef8d)
> => do_writepages (ffffffff810f3380)
> => __writeback_single_inode (ffffffff81166ae5)
> => writeback_sb_inodes (ffffffff81167d4d)
> => __writeback_inodes_wb (ffffffff8116800e)
> => wb_writeback (ffffffff811682bb)
> => wb_check_old_data_flush (ffffffff811683ff)
> => wb_do_writeback (ffffffff81169bd1)
> => bdi_writeback_workfn (ffffffff81169cca)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

which is during allocation triggered by writeback IO.
> Here I have kworker/2:1H which appears to get its work scheduled from
> a softirq.
> 
>           <idle>-0     [002] 256032.890594: sched_wakeup:         comm=kworker/2:1H pid=2176 prio=100 success=1 target_cpu=002
>           <idle>-0     [002] 256032.890598: kernel_stack:         <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => wake_up_worker (ffffffff810568c4)
> => insert_work (ffffffff8105821d)
> => __queue_work (ffffffff81058391)
> => queue_work_on (ffffffff81058605)
> => xfs_buf_ioend (ffffffff811c1a37)
> => _xfs_buf_ioend.constprop.1 (ffffffff811c1aaa)
> => xfs_buf_bio_end_io (ffffffff811c1f58)
> => bio_endio (ffffffff8117362d)
> => blk_update_request (ffffffff8124f1b8)
> => blk_update_bidi_request (ffffffff8124f511)
> => blk_end_bidi_request (ffffffff8124f59c)
> => blk_end_request (ffffffff8124f630)
> => scsi_io_completion (ffffffff8136156f)
> => scsi_finish_command (ffffffff81357e11)
> => scsi_softirq_done (ffffffff813613bf)
> => blk_done_softirq (ffffffff81255fe3)
> => __do_softirq (ffffffff81044230)
> => irq_exit (ffffffff81044505)
> => smp_call_function_single_interrupt (ffffffff81024475)
> => call_function_single_interrupt (ffffffff814d477a)
> => arch_cpu_idle (ffffffff8100ad8e)
> => cpu_startup_entry (ffffffff8107b39e)
> => start_secondary (ffffffff81b259f3)
>           <idle>-0     [002] 256032.890603: sched_switch:         prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/2:1H next_pid=2176 next_prio=100
> 

That's physical IO completion occurring, up to the point where XFS
queues it to a different workqueue...

> 
> kworker/2:1H calls xfs_buf_unlock() which then wakes up kworker/4:0
> 
> 
>     kworker/2:1H-2176  [002] 256032.890607: sched_wakeup:         comm=kworker/4:0 pid=27520 prio=120 success=1 target_cpu=004
>     kworker/2:1H-2176  [002] 256032.890609: kernel_stack:         <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => __up (ffffffff814c8dce)
> => up (ffffffff81065762)
> => xfs_buf_unlock (ffffffff811c163d)
> => xfs_buf_iodone_work (ffffffff811c1908)
> => xfs_buf_ioend (ffffffff811c198b)
> => xfs_buf_iodone_callbacks (ffffffff812222b6)
> => xfs_buf_iodone_work (ffffffff811c18be)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)
> 
> kworker/4:0 wakes up kworker/u49:0 which which releases xfs_iunlock()
> ending the delay.

And that's the IO completed.

> 
>      kworker/4:0-27520 [004] 256032.890646: sched_wakeup:         comm=kworker/u49:0 pid=15748 prio=120 success=1 target_cpu=006
>    kworker/u49:0-15748 [006] 256032.890665: funcgraph_entry:        1.143 us   |    xfs_iunlock();
>    kworker/u49:0-15748 [006] 256032.890666: funcgraph_exit:       ! 710304.855 us |  }
> 
> 
> So to summarize it appears that most of the time was spent with
> kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
> not missing some important intermediate steps does this provide any
> more information about what resource I'm actually waiting for?  Does
> this point to any changes that happened after 3.4?  Are there any tips
> that could help minimize these contentions?

The only difference between this and 3.4 is the allocation workqueue
thread. That, however, won't be introducing second long delays. What
you are seeing here is simply a the latency of waiting for
background metadata IO to complete during an allocation which has
the ilock held....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
@ 2013-06-10 20:53       ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-10 20:53 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: linux-kernel, xfs

On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > > So I guess my question is does anyone know why I'm now seeing these
> > > stalls with 3.10?
> > 
> > Because we made all metadata updates in XFS fully transactional in
> > 3.4:
> > 
> > commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> > Author: Christoph Hellwig <hch@infradead.org>
> > Date:   Wed Feb 29 09:53:52 2012 +0000
> > 
> >     xfs: log timestamp updates
> >     
> >     Timestamps on regular files are the last metadata that XFS does not update
> >     transactionally.  Now that we use the delaylog mode exclusively and made
> >     the log scode scale extremly well there is no need to bypass that code for
> >     timestamp updates.  Logging all updates allows to drop a lot of code, and
> >     will allow for further performance improvements later on.
> >     
> >     Note that this patch drops optimized handling of fdatasync - it will be
> >     added back in a separate commit.
> >     
> >     Reviewed-by: Dave Chinner <dchinner@redhat.com>
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Reviewed-by: Mark Tinguely <tinguely@sgi.com>
> >     Signed-off-by: Ben Myers <bpm@sgi.com>
> > $ git describe --contains 8a9c998
> > v3.4-rc1~55^2~23
> > 
> > IOWs, you're just lucky you haven't noticed it on 3.4....
> > 
> > > Are there any suggestions for how to eliminate them?
> > 
> > Nope. You're stuck with it - there's far more places in the page
> > fault path where you can get stuck on the same lock for the same
> > reason - e.g. during block mapping for the newly added pagecache
> > page...
> > 
> > Hint: mmap() does not provide -deterministic- low latency access to
> > mapped pages - it is only "mostly low latency".
> 
> Hi Dave, I appreciate your time and analysis.  I am sadly aware that
> doing file I/O and expecting low latency is a bit of a stretch.  This
> is also not the first time I've battled these types of stalls and
> realize that the best I can do is search for opportunities to reduce
> the probability of a stall, or find ways to reduce the duration of a
> stall.
> 
> In this case since updating timestamps has been transactional and in
> place since 3.4, it is obvious to me that this is not the cause of
> both the increased rate and duration of the stalls on 3.10.  I assure
> you on 3.4 we have very few stalls that are greater than 10
> milliseconds in our normal workload and with 3.10 I'm seeing them
> regularly.  Since we know I can, and likely do, hit the same code path
> on 3.4 that tells me that me that the xfs_ilock() is likely being held
> for a longer duration on the new kernel.  Let's see if we can
> determine why the lock is held for so long.
> 
> Here is an attempt at that, as events unfold in time order.  I'm in no
> way a filesystem developer so any input or analysis of what we're
> waiting on is appreciated.  There is also multiple kworker threads
> involved which certainly complicates things.
> 
> 
> It starts with kworker/u49:0 which aquires xfs_ilock() inside
> iomap_write_allocate().
> 
> 
>    kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:                   |  xfs_iomap_write_allocate() {
>    kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:        0.074 us   |    xfs_ilock();
> 
> 
> In the next two chunks it appears that kworker/u49:0 calls
> xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
> __xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().
> 
> 
>      kworker/4:0-27520 [004] 256032.180389: sched_switch:         prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120
>      kworker/4:0-27520 [004] 256032.180393: kernel_stack:         <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => __down_common (ffffffff814c8e5e)
> => __down (ffffffff814c8f26)
> => down (ffffffff810658e1)
> => xfs_buf_lock (ffffffff811c12a4)
> => _xfs_buf_find (ffffffff811c1469)
> => xfs_buf_get_map (ffffffff811c16e4)
> => xfs_buf_read_map (ffffffff811c2691)
> => xfs_trans_read_buf_map (ffffffff81225fa9)
> => xfs_btree_read_buf_block.constprop.6 (ffffffff811f2242)
> => xfs_btree_lookup_get_block (ffffffff811f22fb)
> => xfs_btree_lookup (ffffffff811f6707)
> => xfs_alloc_ag_vextent_near (ffffffff811d9d52)
> => xfs_alloc_ag_vextent (ffffffff811da8b5)
> => xfs_alloc_vextent (ffffffff811db545)
> => xfs_bmap_btalloc (ffffffff811e6951)
> => xfs_bmap_alloc (ffffffff811e6dee)
> => __xfs_bmapi_allocate (ffffffff811ec024)
> => xfs_bmapi_allocate_worker (ffffffff811ec283)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

It has blocked waiting for freespace btree IO to complete....

> kworker/u49:0-15748 [004] 256032.180403: sched_switch:         prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100
>    kworker/u49:0-15748 [004] 256032.180408: kernel_stack:         <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => wait_for_completion (ffffffff814ca2e5)
> => xfs_bmapi_allocate (ffffffff811eea57)
> => xfs_bmapi_write (ffffffff811eef33)
> => xfs_iomap_write_allocate (ffffffff811ce49e)
> => return_to_handler (ffffffff814d3845)
> => xfs_map_blocks (ffffffff811bf145)
> => xfs_vm_writepage (ffffffff811bfbc2)
> => __writepage (ffffffff810f14e7)
> => write_cache_pages (ffffffff810f189e)
> => generic_writepages (ffffffff810f1b3a)
> => xfs_vm_writepages (ffffffff811bef8d)
> => do_writepages (ffffffff810f3380)
> => __writeback_single_inode (ffffffff81166ae5)
> => writeback_sb_inodes (ffffffff81167d4d)
> => __writeback_inodes_wb (ffffffff8116800e)
> => wb_writeback (ffffffff811682bb)
> => wb_check_old_data_flush (ffffffff811683ff)
> => wb_do_writeback (ffffffff81169bd1)
> => bdi_writeback_workfn (ffffffff81169cca)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

which is during allocation triggered by writeback IO.
> Here I have kworker/2:1H which appears to get its work scheduled from
> a softirq.
> 
>           <idle>-0     [002] 256032.890594: sched_wakeup:         comm=kworker/2:1H pid=2176 prio=100 success=1 target_cpu=002
>           <idle>-0     [002] 256032.890598: kernel_stack:         <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => wake_up_worker (ffffffff810568c4)
> => insert_work (ffffffff8105821d)
> => __queue_work (ffffffff81058391)
> => queue_work_on (ffffffff81058605)
> => xfs_buf_ioend (ffffffff811c1a37)
> => _xfs_buf_ioend.constprop.1 (ffffffff811c1aaa)
> => xfs_buf_bio_end_io (ffffffff811c1f58)
> => bio_endio (ffffffff8117362d)
> => blk_update_request (ffffffff8124f1b8)
> => blk_update_bidi_request (ffffffff8124f511)
> => blk_end_bidi_request (ffffffff8124f59c)
> => blk_end_request (ffffffff8124f630)
> => scsi_io_completion (ffffffff8136156f)
> => scsi_finish_command (ffffffff81357e11)
> => scsi_softirq_done (ffffffff813613bf)
> => blk_done_softirq (ffffffff81255fe3)
> => __do_softirq (ffffffff81044230)
> => irq_exit (ffffffff81044505)
> => smp_call_function_single_interrupt (ffffffff81024475)
> => call_function_single_interrupt (ffffffff814d477a)
> => arch_cpu_idle (ffffffff8100ad8e)
> => cpu_startup_entry (ffffffff8107b39e)
> => start_secondary (ffffffff81b259f3)
>           <idle>-0     [002] 256032.890603: sched_switch:         prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/2:1H next_pid=2176 next_prio=100
> 

That's physical IO completion occurring, up to the point where XFS
queues it to a different workqueue...

> 
> kworker/2:1H calls xfs_buf_unlock() which then wakes up kworker/4:0
> 
> 
>     kworker/2:1H-2176  [002] 256032.890607: sched_wakeup:         comm=kworker/4:0 pid=27520 prio=120 success=1 target_cpu=004
>     kworker/2:1H-2176  [002] 256032.890609: kernel_stack:         <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => __up (ffffffff814c8dce)
> => up (ffffffff81065762)
> => xfs_buf_unlock (ffffffff811c163d)
> => xfs_buf_iodone_work (ffffffff811c1908)
> => xfs_buf_ioend (ffffffff811c198b)
> => xfs_buf_iodone_callbacks (ffffffff812222b6)
> => xfs_buf_iodone_work (ffffffff811c18be)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)
> 
> kworker/4:0 wakes up kworker/u49:0 which which releases xfs_iunlock()
> ending the delay.

And that's the IO completed.

> 
>      kworker/4:0-27520 [004] 256032.890646: sched_wakeup:         comm=kworker/u49:0 pid=15748 prio=120 success=1 target_cpu=006
>    kworker/u49:0-15748 [006] 256032.890665: funcgraph_entry:        1.143 us   |    xfs_iunlock();
>    kworker/u49:0-15748 [006] 256032.890666: funcgraph_exit:       ! 710304.855 us |  }
> 
> 
> So to summarize it appears that most of the time was spent with
> kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
> not missing some important intermediate steps does this provide any
> more information about what resource I'm actually waiting for?  Does
> this point to any changes that happened after 3.4?  Are there any tips
> that could help minimize these contentions?

The only difference between this and 3.4 is the allocation workqueue
thread. That, however, won't be introducing second long delays. What
you are seeing here is simply a the latency of waiting for
background metadata IO to complete during an allocation which has
the ilock held....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
  2013-06-10 20:53       ` Dave Chinner
@ 2013-06-11 16:17         ` Shawn Bohrer
  -1 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-11 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-kernel

On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote:
> On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So to summarize it appears that most of the time was spent with
> > kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> > up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
> > not missing some important intermediate steps does this provide any
> > more information about what resource I'm actually waiting for?  Does
> > this point to any changes that happened after 3.4?  Are there any tips
> > that could help minimize these contentions?
> 
> The only difference between this and 3.4 is the allocation workqueue
> thread. That, however, won't be introducing second long delays. What
> you are seeing here is simply a the latency of waiting for
> background metadata IO to complete during an allocation which has
> the ilock held....

Again thank you for your analysis Dave.  I've taken a step back to
look at the big picture and that allowed me to identify what _has_
changed between 3.4 and 3.10.  What changed is the behavior of
vm.dirty_expire_centisecs.  Honestly, the previous behavior never made
any sense to me and I'm not entirely sure the current behavior does
either.

In the workload I've been debugging we append data to many small files
using mmap.  The writes are small and the total data rate is very low
thus for most files it may take several minutes to fill a page.
Having low-latency writes are important, but as you know stalls are
always possible.  One way to reduce the probability of a stall is to
reduce the frequency of writeback, and adjusting
vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
allow us to do that.

On kernels 3.4 and older we chose to increase
vm.dirty_expire_centisecs to 30000 since we can comfortably loose 5
minutes of data in the event of a system failure and we believed this
would cause a fairly consistent low data rate as every
vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
that were vm.dirty_expire_centisecs (5min) old.  On old kernels that
isn't exactly what happened.  Instead every 5 minutes there would be a
burst of writeback and a slow trickle at all other times.  This also
reduced the total amount of data written back since the same dirty
page wasn't written back every 30 seconds.  This also virtually
eliminated the stalls we saw so it was left alone.

On 3.10 vm.dirty_expire_centisecs=30000 no longer does the same thing.
Honestly I'm not sure what it does, but the result is a fairly
consistent high data rate being written back to disk.  The fact that
is consistent might lead me to believe that it writes back all pages
that are vm.dirty_expire_centisecs old every
vm.dirty_writeback_centisecs, but the data rate is far too high for
that to be true.  It appears that I can effectively get the same old
behavior by setting vm.dirty_writeback_centisecs=30000.

--
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
@ 2013-06-11 16:17         ` Shawn Bohrer
  0 siblings, 0 replies; 12+ messages in thread
From: Shawn Bohrer @ 2013-06-11 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, xfs

On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote:
> On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So to summarize it appears that most of the time was spent with
> > kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> > up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
> > not missing some important intermediate steps does this provide any
> > more information about what resource I'm actually waiting for?  Does
> > this point to any changes that happened after 3.4?  Are there any tips
> > that could help minimize these contentions?
> 
> The only difference between this and 3.4 is the allocation workqueue
> thread. That, however, won't be introducing second long delays. What
> you are seeing here is simply a the latency of waiting for
> background metadata IO to complete during an allocation which has
> the ilock held....

Again thank you for your analysis Dave.  I've taken a step back to
look at the big picture and that allowed me to identify what _has_
changed between 3.4 and 3.10.  What changed is the behavior of
vm.dirty_expire_centisecs.  Honestly, the previous behavior never made
any sense to me and I'm not entirely sure the current behavior does
either.

In the workload I've been debugging we append data to many small files
using mmap.  The writes are small and the total data rate is very low
thus for most files it may take several minutes to fill a page.
Having low-latency writes are important, but as you know stalls are
always possible.  One way to reduce the probability of a stall is to
reduce the frequency of writeback, and adjusting
vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
allow us to do that.

On kernels 3.4 and older we chose to increase
vm.dirty_expire_centisecs to 30000 since we can comfortably loose 5
minutes of data in the event of a system failure and we believed this
would cause a fairly consistent low data rate as every
vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
that were vm.dirty_expire_centisecs (5min) old.  On old kernels that
isn't exactly what happened.  Instead every 5 minutes there would be a
burst of writeback and a slow trickle at all other times.  This also
reduced the total amount of data written back since the same dirty
page wasn't written back every 30 seconds.  This also virtually
eliminated the stalls we saw so it was left alone.

On 3.10 vm.dirty_expire_centisecs=30000 no longer does the same thing.
Honestly I'm not sure what it does, but the result is a fairly
consistent high data rate being written back to disk.  The fact that
is consistent might lead me to believe that it writes back all pages
that are vm.dirty_expire_centisecs old every
vm.dirty_writeback_centisecs, but the data rate is far too high for
that to be true.  It appears that I can effectively get the same old
behavior by setting vm.dirty_writeback_centisecs=30000.

--
Shawn

-- 

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
  2013-06-11 16:17         ` Shawn Bohrer
@ 2013-06-11 22:40           ` Dave Chinner
  -1 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-11 22:40 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: xfs, linux-kernel

On Tue, Jun 11, 2013 at 11:17:35AM -0500, Shawn Bohrer wrote:
> In the workload I've been debugging we append data to many small files
> using mmap.  The writes are small and the total data rate is very low
> thus for most files it may take several minutes to fill a page.
> Having low-latency writes are important, but as you know stalls are
> always possible.  One way to reduce the probability of a stall is to
> reduce the frequency of writeback, and adjusting
> vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
> allow us to do that.
> 
> On kernels 3.4 and older we chose to increase
> vm.dirty_expire_centisecs to 30000 since we can comfortably loose 5
> minutes of data in the event of a system failure and we believed this
> would cause a fairly consistent low data rate as every
> vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
> that were vm.dirty_expire_centisecs (5min) old.

I'm not surprised that behaviour ni the VM has changed - delaying
flusher thread execution for up to 5 minutes on dirty inodes means
that memory pressure is determining your writeback patterns.

> On old kernels that
> isn't exactly what happened.  Instead every 5 minutes there would be a
> burst of writeback and a slow trickle at all other times.  This also
> reduced the total amount of data written back since the same dirty
> page wasn't written back every 30 seconds.  This also virtually
> eliminated the stalls we saw so it was left alone.

So the slow trickle is memory pressure driven writeback....

> On 3.10 vm.dirty_expire_centisecs=30000 no longer does the same thing.
> Honestly I'm not sure what it does, but the result is a fairly
> consistent high data rate being written back to disk.  The fact that
> is consistent might lead me to believe that it writes back all pages
> that are vm.dirty_expire_centisecs old every
> vm.dirty_writeback_centisecs, but the data rate is far too high for
> that to be true.  It appears that I can effectively get the same old
> behavior by setting vm.dirty_writeback_centisecs=30000.

You probably need to do some tracepoint analysis to find out what is
triggering the writeback. Once we have an idea of the writeback
trigger, we might be able to explain the difference in behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 3.10-rc4 stalls during mmap writes
@ 2013-06-11 22:40           ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-06-11 22:40 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: linux-kernel, xfs

On Tue, Jun 11, 2013 at 11:17:35AM -0500, Shawn Bohrer wrote:
> In the workload I've been debugging we append data to many small files
> using mmap.  The writes are small and the total data rate is very low
> thus for most files it may take several minutes to fill a page.
> Having low-latency writes are important, but as you know stalls are
> always possible.  One way to reduce the probability of a stall is to
> reduce the frequency of writeback, and adjusting
> vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
> allow us to do that.
> 
> On kernels 3.4 and older we chose to increase
> vm.dirty_expire_centisecs to 30000 since we can comfortably loose 5
> minutes of data in the event of a system failure and we believed this
> would cause a fairly consistent low data rate as every
> vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
> that were vm.dirty_expire_centisecs (5min) old.

I'm not surprised that behaviour ni the VM has changed - delaying
flusher thread execution for up to 5 minutes on dirty inodes means
that memory pressure is determining your writeback patterns.

> On old kernels that
> isn't exactly what happened.  Instead every 5 minutes there would be a
> burst of writeback and a slow trickle at all other times.  This also
> reduced the total amount of data written back since the same dirty
> page wasn't written back every 30 seconds.  This also virtually
> eliminated the stalls we saw so it was left alone.

So the slow trickle is memory pressure driven writeback....

> On 3.10 vm.dirty_expire_centisecs=30000 no longer does the same thing.
> Honestly I'm not sure what it does, but the result is a fairly
> consistent high data rate being written back to disk.  The fact that
> is consistent might lead me to believe that it writes back all pages
> that are vm.dirty_expire_centisecs old every
> vm.dirty_writeback_centisecs, but the data rate is far too high for
> that to be true.  It appears that I can effectively get the same old
> behavior by setting vm.dirty_writeback_centisecs=30000.

You probably need to do some tracepoint analysis to find out what is
triggering the writeback. Once we have an idea of the writeback
trigger, we might be able to explain the difference in behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-11 22:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-07 19:37 3.10-rc4 stalls during mmap writes Shawn Bohrer
2013-06-07 19:37 ` Shawn Bohrer
2013-06-09  3:37 ` Dave Chinner
2013-06-09  3:37   ` Dave Chinner
2013-06-10 18:45   ` Shawn Bohrer
2013-06-10 18:45     ` Shawn Bohrer
2013-06-10 20:53     ` Dave Chinner
2013-06-10 20:53       ` Dave Chinner
2013-06-11 16:17       ` Shawn Bohrer
2013-06-11 16:17         ` Shawn Bohrer
2013-06-11 22:40         ` Dave Chinner
2013-06-11 22:40           ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.