All of lore.kernel.org
 help / color / mirror / Atom feed
* Slow request warnings on 0.48
@ 2012-07-04 16:53 David Blundell
  2012-07-04 16:58 ` Alexandre DERUMIER
  0 siblings, 1 reply; 12+ messages in thread
From: David Blundell @ 2012-07-04 16:53 UTC (permalink / raw)
  To: ceph-devel

I have three servers running mon and osd using Ubuntu 12.04 that I have been testing with RADOS storing RBD KVM instances

0.47.3 worked extremely well (once I got over a few btrfs issues).  The same servers running 0.48 give a large number of "[WRN] slow request" messages whenever I generate a lot of random IO in the KVM instances using iozone.  The slow responses eventually leads to disk timeouts on the KVM instances.

I have erased the osds and recreated on new btrfs volumes with the same result.

I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, inode64,delaylog,logbufs=8,logbsize=256k

Xfs gives the same result - the iozone tests run fine until the random IO starts and then there are lots of slow request warnings.

Does anyone have any ideas about the best place to start troubleshooting / debugging?

Thanks,

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-04 16:53 Slow request warnings on 0.48 David Blundell
@ 2012-07-04 16:58 ` Alexandre DERUMIER
  2012-07-04 18:59   ` Gregory Farnum
  2012-07-05 17:58   ` Mark Nelson
  0 siblings, 2 replies; 12+ messages in thread
From: Alexandre DERUMIER @ 2012-07-04 16:58 UTC (permalink / raw)
  To: David Blundell; +Cc: ceph-devel

Hi, I see same messages here after upgrade to 0.48.

with random write benchmark.

I have more lags than before with 0.47 (but disks are at 100% usage, so can't tell if it's normal or not)


----- Mail original ----- 

De: "David Blundell" <David.Blundell@100percentit.com> 
À: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 4 Juillet 2012 18:53:02 
Objet: Slow request warnings on 0.48 

I have three servers running mon and osd using Ubuntu 12.04 that I have been testing with RADOS storing RBD KVM instances 

0.47.3 worked extremely well (once I got over a few btrfs issues). The same servers running 0.48 give a large number of "[WRN] slow request" messages whenever I generate a lot of random IO in the KVM instances using iozone. The slow responses eventually leads to disk timeouts on the KVM instances. 

I have erased the osds and recreated on new btrfs volumes with the same result. 

I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, inode64,delaylog,logbufs=8,logbsize=256k 

Xfs gives the same result - the iozone tests run fine until the random IO starts and then there are lots of slow request warnings. 

Does anyone have any ideas about the best place to start troubleshooting / debugging? 

Thanks, 

David 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-04 16:58 ` Alexandre DERUMIER
@ 2012-07-04 18:59   ` Gregory Farnum
  2012-07-04 22:03     ` David Blundell
  2012-07-05 17:58   ` Mark Nelson
  1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-07-04 18:59 UTC (permalink / raw)
  To: Alexandre DERUMIER, David Blundell; +Cc: ceph-devel

That's odd — there isn't too much that went into the OSD between 0.47 and 0.48 that I can think of, and most of that only impact OSDs when they go through bootup. What does ceph -s display — are all the PGs healthy?  
-Greg


On Wednesday, July 4, 2012 at 9:58 AM, Alexandre DERUMIER wrote:

> Hi, I see same messages here after upgrade to 0.48.
>  
> with random write benchmark.
>  
> I have more lags than before with 0.47 (but disks are at 100% usage, so can't tell if it's normal or not)
>  
>  
> ----- Mail original -----  
>  
> De: "David Blundell" <David.Blundell@100percentit.com (mailto:David.Blundell@100percentit.com)>  
> À: ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)  
> Envoyé: Mercredi 4 Juillet 2012 18:53:02  
> Objet: Slow request warnings on 0.48  
>  
> I have three servers running mon and osd using Ubuntu 12.04 that I have been testing with RADOS storing RBD KVM instances  
>  
> 0.47.3 worked extremely well (once I got over a few btrfs issues). The same servers running 0.48 give a large number of "[WRN] slow request" messages whenever I generate a lot of random IO in the KVM instances using iozone. The slow responses eventually leads to disk timeouts on the KVM instances.  
>  
> I have erased the osds and recreated on new btrfs volumes with the same result.  
>  
> I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, inode64,delaylog,logbufs=8,logbsize=256k  
>  
> Xfs gives the same result - the iozone tests run fine until the random IO starts and then there are lots of slow request warnings.  
>  
> Does anyone have any ideas about the best place to start troubleshooting / debugging?  
>  
> Thanks,  
>  
> David  
> --  
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in  
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)  
> More majordomo info at http://vger.kernel.org/majordomo-info.html  
>  
>  
>  
> --  
>  
> --  
>  
>  
>  
>  
> Alexandre D e rumier  
>  
> Ingénieur Systèmes et Réseaux  
>  
>  
> Fixe : 03 20 68 88 85  
>  
> Fax : 03 20 68 90 88  
>  
>  
> 45 Bvd du Général Leclerc 59100 Roubaix  
> 12 rue Marivaux 75002 Paris  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-04 18:59   ` Gregory Farnum
@ 2012-07-04 22:03     ` David Blundell
  0 siblings, 0 replies; 12+ messages in thread
From: David Blundell @ 2012-07-04 22:03 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Alexandre DERUMIER, <ceph-devel@vger.kernel.org>

On 4 Jul 2012, at 19:59, Gregory Farnum wrote:

> That's odd — there isn't too much that went into the OSD between 0.47 and 0.48 that I can think of, and most of that only impact OSDs when they go through bootup. What does ceph -s display — are all the PGs healthy?  
> -Greg
> 


Hi Greg,

The PGs all seem to be healthy:

root@store1:~# ceph -s
   health HEALTH_OK
   monmap e1: 3 mons at {0=10.0.1.40:6789/0,1=10.0.1.41:6789/0,2=10.0.1.42:6789/0}, election epoch 40, quorum 0,1,2 0,1,2
   osdmap e342: 7 osds: 7 up, 7 in
    pgmap v5403: 1344 pgs: 1344 active+clean; 4620 MB data, 9617 MB used, 1368 GB / 1377 GB avail
   mdsmap e50: 0/0/1 up

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-04 16:58 ` Alexandre DERUMIER
  2012-07-04 18:59   ` Gregory Farnum
@ 2012-07-05 17:58   ` Mark Nelson
  2012-07-05 18:16     ` David Blundell
                       ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Mark Nelson @ 2012-07-05 17:58 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: David Blundell, ceph-devel

On 07/04/2012 11:58 AM, Alexandre DERUMIER wrote:
> Hi, I see same messages here after upgrade to 0.48.
>
> with random write benchmark.
>
> I have more lags than before with 0.47 (but disks are at 100% usage, so can't tell if it's normal or not)
>
>
> ----- Mail original -----
>
> De: "David Blundell"<David.Blundell@100percentit.com>
> À: ceph-devel@vger.kernel.org
> Envoyé: Mercredi 4 Juillet 2012 18:53:02
> Objet: Slow request warnings on 0.48
>
> I have three servers running mon and osd using Ubuntu 12.04 that I have been testing with RADOS storing RBD KVM instances
>
> 0.47.3 worked extremely well (once I got over a few btrfs issues). The same servers running 0.48 give a large number of "[WRN] slow request" messages whenever I generate a lot of random IO in the KVM instances using iozone. The slow responses eventually leads to disk timeouts on the KVM instances.
>
> I have erased the osds and recreated on new btrfs volumes with the same result.
>
> I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, inode64,delaylog,logbufs=8,logbsize=256k
>
> Xfs gives the same result - the iozone tests run fine until the random IO starts and then there are lots of slow request warnings.
>
> Does anyone have any ideas about the best place to start troubleshooting / debugging?
>
> Thanks,
>
> David

Hi David and Alexandre,

Does this only happen with random writes or also sequential writes?  If 
it happens with sequential writes as well, does it happen with rados bench?

-- 
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Slow request warnings on 0.48
  2012-07-05 17:58   ` Mark Nelson
@ 2012-07-05 18:16     ` David Blundell
  2012-07-05 18:33     ` Alexandre DERUMIER
  2012-07-05 18:43     ` David Blundell
  2 siblings, 0 replies; 12+ messages in thread
From: David Blundell @ 2012-07-05 18:16 UTC (permalink / raw)
  To: Mark Nelson, Alexandre DERUMIER; +Cc: ceph-devel

> Hi David and Alexandre,
> 
> Does this only happen with random writes or also sequential writes?  If it
> happens with sequential writes as well, does it happen with rados bench?
> 
> --
> Mark Nelson
> Performance Engineer
> Inktank

Hi Mark,

I have only ever seen it with random writes.  I'll retry rados bench in a few minutes to double check - are there any other tests you would like me to run?

I'm currently waiting for some iozone tests to finish.  When the sequential tests are running the logs are fine, as soon as the random writes start, the logs start to fill with messages like:

2012-07-05 19:10:00.599250 osd.6 10.0.1.42:6802/2145 1151 : [WRN] slow request 37.933071 seconds old, received at 2012-07-05 19:09:22.665917: osd_op(client.96416.0:91965 rb.0.1.00000000015f [write 4022272~4096] 2.3777e91a snapc 11=[11,10]) v4 currently waiting for sub ops
2012-07-05 19:10:00.599258 osd.6 10.0.1.42:6802/2145 1152 : [WRN] slow request 37.932836 seconds old, received at 2012-07-05 19:09:22.666152: osd_op(client.96416.0:91966 rb.0.1.00000000015f [write 4034560~4096] 2.3777e91a snapc 11=[11,10]) v4 currently waiting for sub ops
2012-07-05 19:10:03.278141 mon.0 10.0.1.40:6789/0 493 : [INF] pgmap v7564: 1344 pgs: 1344 active+clean; 5183 MB data, 11066 MB used, 1367 GB / 1377 GB avail
2012-07-05 19:09:55.388448 osd.3 10.0.1.41:6802/2540 160 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 32.622016 secs
2012-07-05 19:09:55.388463 osd.3 10.0.1.41:6802/2540 161 : [WRN] slow request 32.622016 seconds old, received at 2012-07-05 19:09:22.766269: osd_op(client.96416.0:92308 rb.0.1.00000000017b [write 4001792~4096] 2.f606a6c6 snapc 11=[11,10]) v4 currently waiting for sub ops	

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-05 17:58   ` Mark Nelson
  2012-07-05 18:16     ` David Blundell
@ 2012-07-05 18:33     ` Alexandre DERUMIER
  2012-07-05 18:43     ` David Blundell
  2 siblings, 0 replies; 12+ messages in thread
From: Alexandre DERUMIER @ 2012-07-05 18:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: David Blundell, ceph-devel

It was during a random write (fio benchmark).

I can't reproduce it now,I'll try to do tests again this week.

----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "David Blundell" <David.Blundell@100percentit.com>, ceph-devel@vger.kernel.org 
Envoyé: Jeudi 5 Juillet 2012 19:58:27 
Objet: Re: Slow request warnings on 0.48 

On 07/04/2012 11:58 AM, Alexandre DERUMIER wrote: 
> Hi, I see same messages here after upgrade to 0.48. 
> 
> with random write benchmark. 
> 
> I have more lags than before with 0.47 (but disks are at 100% usage, so can't tell if it's normal or not) 
> 
> 
> ----- Mail original ----- 
> 
> De: "David Blundell"<David.Blundell@100percentit.com> 
> À: ceph-devel@vger.kernel.org 
> Envoyé: Mercredi 4 Juillet 2012 18:53:02 
> Objet: Slow request warnings on 0.48 
> 
> I have three servers running mon and osd using Ubuntu 12.04 that I have been testing with RADOS storing RBD KVM instances 
> 
> 0.47.3 worked extremely well (once I got over a few btrfs issues). The same servers running 0.48 give a large number of "[WRN] slow request" messages whenever I generate a lot of random IO in the KVM instances using iozone. The slow responses eventually leads to disk timeouts on the KVM instances. 
> 
> I have erased the osds and recreated on new btrfs volumes with the same result. 
> 
> I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, inode64,delaylog,logbufs=8,logbsize=256k 
> 
> Xfs gives the same result - the iozone tests run fine until the random IO starts and then there are lots of slow request warnings. 
> 
> Does anyone have any ideas about the best place to start troubleshooting / debugging? 
> 
> Thanks, 
> 
> David 

Hi David and Alexandre, 

Does this only happen with random writes or also sequential writes? If 
it happens with sequential writes as well, does it happen with rados bench? 

-- 
Mark Nelson 
Performance Engineer 
Inktank 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Slow request warnings on 0.48
  2012-07-05 17:58   ` Mark Nelson
  2012-07-05 18:16     ` David Blundell
  2012-07-05 18:33     ` Alexandre DERUMIER
@ 2012-07-05 18:43     ` David Blundell
  2012-07-05 19:49       ` Mark Nelson
  2 siblings, 1 reply; 12+ messages in thread
From: David Blundell @ 2012-07-05 18:43 UTC (permalink / raw)
  To: Mark Nelson, Alexandre DERUMIER; +Cc: ceph-devel

> Hi David and Alexandre,
> 
> Does this only happen with random writes or also sequential writes?  If it
> happens with sequential writes as well, does it happen with rados bench?
> 
> --
> Mark Nelson
> Performance Engineer
> Inktank

Hi Mark,

I just ran "rados -p data bench 60 write -t 16" and a few dd tests with no problems at all so at the moment it looks like only random IO triggers the slow writes.

Please do let me know if there are any other tests that I can do to help track down the cause.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-05 18:43     ` David Blundell
@ 2012-07-05 19:49       ` Mark Nelson
  2012-07-05 20:21         ` Samuel Just
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Nelson @ 2012-07-05 19:49 UTC (permalink / raw)
  To: David Blundell; +Cc: Alexandre DERUMIER, ceph-devel

On 07/05/2012 01:43 PM, David Blundell wrote:
>> Hi David and Alexandre,
>>
>> Does this only happen with random writes or also sequential writes?  If it
>> happens with sequential writes as well, does it happen with rados bench?
>>
>> --
>> Mark Nelson
>> Performance Engineer
>> Inktank
>
> Hi Mark,
>
> I just ran "rados -p data bench 60 write -t 16" and a few dd tests with no problems at all so at the moment it looks like only random IO triggers the slow writes.
>
> Please do let me know if there are any other tests that I can do to help track down the cause.
>
> David

Thanks David!  We've got some people internally taking a look at this. 
I'll let you guys know if there is anything else we need!

Thanks,
Mark

-- 
Mark Nelson
Performance Engineer
Inktank

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-05 19:49       ` Mark Nelson
@ 2012-07-05 20:21         ` Samuel Just
  2012-07-05 22:15           ` David Blundell
  0 siblings, 1 reply; 12+ messages in thread
From: Samuel Just @ 2012-07-05 20:21 UTC (permalink / raw)
  To: Mark Nelson; +Cc: David Blundell, Alexandre DERUMIER, ceph-devel

David,

Could you try rados -p data bench 60 write -t 16 -b 4096?

rados bench defaults to 4MB objects, this'll give us results for 4k objects.

If you could give me the latency too, that would help.
-Sam

On Thu, Jul 5, 2012 at 12:49 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 07/05/2012 01:43 PM, David Blundell wrote:
>>>
>>> Hi David and Alexandre,
>>>
>>> Does this only happen with random writes or also sequential writes?  If
>>> it
>>> happens with sequential writes as well, does it happen with rados bench?
>>>
>>> --
>>> Mark Nelson
>>> Performance Engineer
>>> Inktank
>>
>>
>> Hi Mark,
>>
>> I just ran "rados -p data bench 60 write -t 16" and a few dd tests with no
>> problems at all so at the moment it looks like only random IO triggers the
>> slow writes.
>>
>> Please do let me know if there are any other tests that I can do to help
>> track down the cause.
>>
>> David
>
>
> Thanks David!  We've got some people internally taking a look at this. I'll
> let you guys know if there is anything else we need!
>
> Thanks,
> Mark
>
>
> --
> Mark Nelson
> Performance Engineer
> Inktank
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
  2012-07-05 20:21         ` Samuel Just
@ 2012-07-05 22:15           ` David Blundell
  0 siblings, 0 replies; 12+ messages in thread
From: David Blundell @ 2012-07-05 22:15 UTC (permalink / raw)
  To: Samuel Just; +Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel


On 5 Jul 2012, at 21:21, Samuel Just wrote:

> David,
> 
> Could you try rados -p data bench 60 write -t 16 -b 4096?
> 
> rados bench defaults to 4MB objects, this'll give us results for 4k objects.
> 
> If you could give me the latency too, that would help.
> -Sam

Hi Sam,

I first ran this with the standard ceph settings giving http://pastebin.com/MWLxEazS

This did not cause any slow request warnings so I set "filestore queue max ops = 5000" to increase the number of requests in flight.  This resulted in http://pastebin.com/yFnALmGW and also a small number of slow request warnings.  I ran it again with similar results http://pastebin.com/VnKSVmsq

If there's anything else you need, please let me know.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Slow request warnings on 0.48
@ 2012-07-19 10:48 Matthew Richardson
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Richardson @ 2012-07-19 10:48 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

I'd just like to report the same behaviour on my test cluster with 0.48.

I've set up a single box (Sl6.1 - 2.6.32-220.23.1 kernel) with 1 mds,
mon and osd, and replication set to '1' for both data and metadata.

Having mounted using ceph-fuse, I'm running a simple fio job to create load:

[global]
directory=/mnt/ceph
size=500M
rw=read
ioengine=libaio

[simple]

I'm then watching the latency with ioping.

With rw=read, rw=randread (random reads) or rw=write (sequential writes)
I see no problems and the latency sits around 1-2ms.  However, with
rw=randwrite (random writes) I see the latency jump to between 5 and 60
seconds, and the following types of warning lines appear:

2012-07-19 10:29:39.417625 osd.0 [WRN] 11 slow requests, 6 included
below; oldest blocked for > 54.425766 secs
[WRN] slow request 54.420958 seconds old, received at 2012-07-19
10:28:44.996584: osd_op(client.4113.0:9153 100000003ed.0000003b [write
847872~4096] 0.dc4b476f snapc 1=[]) v4 currently started
2012-07-19 10:29:39.417641 osd.0 [WRN] slow request 54.420587 seconds
old, received at 2012-07-19 10:28:44.996955: osd_op(client.4113.0:9154
100000003ed.00000000 [write 1175552~4096] 0.44a7cb80 snapc 1=[]) v4
currently started
[...snip...]


Let me know if there's any more information that I can provide that
might help with diagnosing the problem (also bearing in mind that I'm
new to ceph so might need extra notes on generating tests, dumps etc :) )

Thanks,

Matthew


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-07-19 11:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-04 16:53 Slow request warnings on 0.48 David Blundell
2012-07-04 16:58 ` Alexandre DERUMIER
2012-07-04 18:59   ` Gregory Farnum
2012-07-04 22:03     ` David Blundell
2012-07-05 17:58   ` Mark Nelson
2012-07-05 18:16     ` David Blundell
2012-07-05 18:33     ` Alexandre DERUMIER
2012-07-05 18:43     ` David Blundell
2012-07-05 19:49       ` Mark Nelson
2012-07-05 20:21         ` Samuel Just
2012-07-05 22:15           ` David Blundell
2012-07-19 10:48 Matthew Richardson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.