Ceph Crach at sync_thread_timeout after heavy random writes.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Ceph Crach at sync_thread_timeout after heavy random writes.
@ 2013-03-25  9:01 Chen, Xiaoxi
  2013-03-25 15:34 ` [ceph-users] " Sage Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Chen, Xiaoxi @ 2013-03-25  9:01 UTC (permalink / raw)
  To: 'ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org'
	(ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org)
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 1795 bytes --]

Hi list,
         We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD.
         My test environment is:
                            4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
                            48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via Qemu.
                            Ceph @ 0.58
                            XFS were used.
         I am using Aiostress (something like FIO) to produce random write requests on top of each RBDs.

         From Ceph-w , ceph reports a very high Ops (10000+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write.
         When digging into the code, I found that the OSD write data to Pagecache than returned, although it called ::sync_file_range, but this syscall doesn't actually sync data to disk when it return,it's an aync call. So the situation is , the random write will be extremely fast since it only write to journal and pagecache, but once syncing , it will take very long time. The speed gap between journal and OSDs exist, the amount of data that need to be sync keep increasing, and it will certainly exceed 600s.

         For more information, I have tried to reproduce this by rados bench,but failed.

         Could you please let me know if you need any more informations & have some solutions? Thanks
                                                                                                                                                                                                                                                            Xiaoxi

[-- Attachment #1.2: Type: text/html, Size: 6759 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
  2013-03-25  9:01 Ceph Crach at sync_thread_timeout after heavy random writes Chen, Xiaoxi
@ 2013-03-25 15:34 ` Sage Weil
  2013-03-26  0:58   ` Chen, Xiaoxi
  0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2013-03-25 15:34 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com),
	ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2318 bytes --]

Hi Xiaoxi,

On Mon, 25 Mar 2013, Chen, Xiaoxi wrote:
>          From Ceph-w , ceph reports a very high Ops (10000+ /s) , but
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random
> write.
> 
>          When digging into the code, I found that the OSD write data to
> Pagecache than returned, although it called ::sync_file_range, but this
> syscall doesn?t actually sync data to disk when it return,it?s an aync call.
> So the situation is , the random write will be extremely fast since it only
> write to journal and pagecache, but once syncing , it will take very long
> time. The speed gap between journal and OSDs exist, the amount of data that
> need to be sync keep increasing, and it will certainly exceed 600s.

The sync_file_range is only there to push things to disk sooner, so that 
the eventual syncfs(2) takes less time.  When the async flushing is 
enabled, there is a limit to the number of flushes that are in the queue, 
but if it hits the max it just does

    dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off << "~" << len
	     << " qlen " << flusher_queue_len 
	     << " hit flusher_max_fds " << m_filestore_flusher_max_fds
	     << ", skipping async flush" << dendl;

Can you confirm that the filestore is taking this path?  (debug filestore 
= 10 and then reproduce.)

You may want to try

 filestore flusher = false
 filestore sync flush = true

and see if that changes things--it will make the sync_file_range() happen 
inline after the write.

Anyway, it sounds like you may be queueing up so many random writes that 
the sync takes forever.  I've never actually seen that happen, so if we 
can confirm that's what is going on that will be very interesting.

Thanks-
sage


> 
>  
> 
>          For more information, I have tried to reproduce this by rados
> bench,but failed.
> 
>  
> 
>          Could you please let me know if you need any more informations &
> have some solutions? Thanks
> 
>                                                                           
?? ?                                                                          
?? ?                                                                          
?? ?                           Xiaoxi
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
  2013-03-25 15:34 ` [ceph-users] " Sage Weil
@ 2013-03-26  0:58   ` Chen, Xiaoxi
  0 siblings, 0 replies; 4+ messages in thread
From: Chen, Xiaoxi @ 2013-03-26  0:58 UTC (permalink / raw)
  To: Sage Weil
  Cc: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com),
	ceph-devel

Hi Sage,
	Thanks for your mail.When turn on filestore sync flush, it seems works and OSD process doesn't suicide any more . I have already disabled flusher long age since both Mark's and my report show disable flusher seems to improve performance(so my original configuration is filestore_flusher=false, filestore_sync_flush=false(default)), but now we have to reconsider on this. I would like to see the internal code of ::sync_file_range() to learn more about how it works. First guess is ::sync_file_range will push request to disk queue and if the disk queue is full, this call will block and wait, but not sure.

	But from the code path,(BTW, these lines of codes are a bit hard to follow)
			if (!should_flush ||	!m_filestore_flusher ||	!queue_flusher(fd, offset, len)) 
			{
      			if (should_flush && m_filestore_sync_flush)
					::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE);
      			lfn_close(fd);
    		}
	With the default setting (m_filestore_flusher = true) , the flusher queue will soon burn out, in this situation, if user doesn't turn on " m_filestore_sync_flush = ture ", he/she will likely to hit the same situation that writes remain in page cache and OSD daemon died when trying to sync. I suppose the right logical should be(persuade code), :
			if (should_flush) 
			{
				If(m_filestore_flusher)
					If(queue_flusher(fd, offset, len)
						Do nothing
					Else
						::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE);
				Else
					if (m_filestore_sync_flush )
						::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE);
      			lfn_close(fd);
    		}

																													Xiaoxi
-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com] 
Sent: 2013年3月25日 23:35
To: Chen, Xiaoxi
Cc: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com); ceph-devel@vger.kernel.org
Subject: Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

Hi Xiaoxi,

On Mon, 25 Mar 2013, Chen, Xiaoxi wrote:
>          From Ceph-w , ceph reports a very high Ops (10000+ /s) , but 
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K 
> random write.
> 
>          When digging into the code, I found that the OSD write data 
> to Pagecache than returned, although it called ::sync_file_range, but 
> this syscall doesn?t actually sync data to disk when it return,it?s an aync call.
> So the situation is , the random write will be extremely fast since it 
> only write to journal and pagecache, but once syncing , it will take 
> very long time. The speed gap between journal and OSDs exist, the 
> amount of data that need to be sync keep increasing, and it will certainly exceed 600s.

The sync_file_range is only there to push things to disk sooner, so that the eventual syncfs(2) takes less time.  When the async flushing is enabled, there is a limit to the number of flushes that are in the queue, but if it hits the max it just does

    dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off << "~" << len
	     << " qlen " << flusher_queue_len 
	     << " hit flusher_max_fds " << m_filestore_flusher_max_fds
	     << ", skipping async flush" << dendl;

Can you confirm that the filestore is taking this path?  (debug filestore = 10 and then reproduce.)

You may want to try

 filestore flusher = false
 filestore sync flush = true

and see if that changes things--it will make the sync_file_range() happen inline after the write.

Anyway, it sounds like you may be queueing up so many random writes that the sync takes forever.  I've never actually seen that happen, so if we can confirm that's what is going on that will be very interesting.

Thanks-
sage


> 
>  
> 
>          For more information, I have tried to reproduce this by rados 
> bench,but failed.
> 
>  
> 
>          Could you please let me know if you need any more 
> informations & have some solutions? Thanks
> 
>                                                                           
?? ? ?? ? ?? ?                           Xiaoxi
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ceph Crach at sync_thread_timeout after heavy random writes.
@ 2013-03-25 13:59 Chen, Xiaoxi
  0 siblings, 0 replies; 4+ messages in thread
From: Chen, Xiaoxi @ 2013-03-25 13:59 UTC (permalink / raw)
  To: Chen, Xiaoxi,
	'ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org'
	(ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org)
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA

Rephrase it to make it more clear

From: ceph-users-bounces@lists.ceph.com [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Chen, Xiaoxi
Sent: 2013年3月25日 17:02
To: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)
Cc: ceph-devel@vger.kernel.org
Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

Hi list,
         We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD.
         My test environment is:
                            4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
                            48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via QEMU, The Qemu Cache disabled.
                            Ceph @ 0.58
                            XFS were used.
         I am running  Aiostress (something like FIO) inside VMS to produce random write requests on top of each RBDs.

         From Ceph-w , ceph reports a very high Ops (10000+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write.
         When digging into the code, from Filestore.cc::_write(), it's clear that the OSD open object files without O_DIRECT, that means data writes will be buffered by pagecache, and then returned.Although ::sync_file_range called , but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync data to disk before it returns ,instead, it just initiate the write out IOs. 
                    So the situation is , since all writes just go to pagecache , the backend OSD data disk **seems** extremely fast for random write, so we can see  such a high Ops from Ceph-w. However, when OSD Sync_thread trying to sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure that all dirty page in PageCache(relate with that particular FS)  had written into disk. This will obviously take long time and you can only expect 100 IOPS for non-btrfs filesystem.   The performance gap exists there, a SSD journal can do 4K random wirte @  1K IOPS +, but for 4 HDDs(journaled  by the same SSD), they can only provide 400IOPS.
With the random write pressure continuing , the amount of dirty page in PageCache will keep increasing , sooner or later, the ::syncfs() cannot return within 600s(the default value of filestore_commit_timeout ) and triggered the ASSERT to suicide ceph-osd process.

   I have tried to reproduce this by rados bench,but failed.Because rados bench **create** objects rather than modify them, a bucket of creates can be merged into a single big writes. So I assume if anyone like to reproduce this issue, you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow data disk, choosing a small filestore_commit_timeout may be helpful to reproduce this issue in a small scale environment.

         Could you please let me know if you need any more informations & have some solutions? Thanks
                                                                                                                                                                                                                                                            Xiaoxi
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-03-26  0:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-25  9:01 Ceph Crach at sync_thread_timeout after heavy random writes Chen, Xiaoxi
2013-03-25 15:34 ` [ceph-users] " Sage Weil
2013-03-26  0:58   ` Chen, Xiaoxi
2013-03-25 13:59 Chen, Xiaoxi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.