All of lore.kernel.org
 help / color / mirror / Atom feed
* Degraded PGs blocking open()?
@ 2011-06-07  0:37 Székelyi Szabolcs
  2011-06-07  2:15 ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Székelyi Szabolcs @ 2011-06-07  0:37 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1040 bytes --]

Hi all,

I have a three node ceph setup, two nodes playing all three roles (OSD, MDS, 
MON), and one being just a monitor (which happens to be the client I'm using 
the filesystem from).

I want to achieve high availablity by mirroring all data between the OSDs and 
being able to still access everything even if one of them goes down. The 
mirroring works fine, I see the space being consumed on both nodes as I copy 
data on the file system. According to `ceph -s`, all PGs are in active+clean 
state. If I start reading a big file and shut down one of the (OSD+MDS+MON) 
nodes, the file can still be read until the end, that's fine. Moreover, the 
contents read back seem correct when compared to the original file. Very nice. 
But if I start reading the file while one of the nodes is down, it blocks until 
the node comes up again. I can't even kill the reading process with KILL, 
TERM, or INT.

Am I doing something wrong, or was not careful enough reading the docs, or may 
this be a bug? My ceph.conf is attached.

Thanks,
-- 
cc


[-- Attachment #2: ceph.conf --]
[-- Type: text/plain, Size: 402 bytes --]

[global]
auth supported = cephx
keyring = /etc/ceph/keyring.$name



[mds]

[mds.0]
host = iscsigw1

[mds.1]
host = iscsigw2



[osd]
osd data = /srv/ceph/osd.$id

[osd.0]
host = iscsigw1

[osd.1]
host = iscsigw2



[mon]
mon data = /srv/ceph/mon.$id

[mon.0]
host = iscsigw1
mon addr = <node1_ip>:6789

[mon.1]
host = iscsigw2
mon addr = <node2_ip>:6789

[mon.cc]
host = cc
mon addr = <node3_ip>:6789

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Degraded PGs blocking open()?
  2011-06-07  0:37 Degraded PGs blocking open()? Székelyi Szabolcs
@ 2011-06-07  2:15 ` Gregory Farnum
  2011-06-09 10:23   ` Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?) Székelyi Szabolcs
  0 siblings, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2011-06-07  2:15 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

2011/6/6 Székelyi Szabolcs <szekelyi@niif.hu>:
> Hi all,
>
> I have a three node ceph setup, two nodes playing all three roles (OSD, MDS,
> MON), and one being just a monitor (which happens to be the client I'm using
> the filesystem from).
>
> I want to achieve high availablity by mirroring all data between the OSDs and
> being able to still access everything even if one of them goes down. The
> mirroring works fine, I see the space being consumed on both nodes as I copy
> data on the file system. According to `ceph -s`, all PGs are in active+clean
> state. If I start reading a big file and shut down one of the (OSD+MDS+MON)
> nodes, the file can still be read until the end, that's fine. Moreover, the
> contents read back seem correct when compared to the original file. Very nice.
> But if I start reading the file while one of the nodes is down, it blocks until
> the node comes up again. I can't even kill the reading process with KILL,
> TERM, or INT.
>
> Am I doing something wrong, or was not careful enough reading the docs, or may
> this be a bug? My ceph.conf is attached.
The problem isn't in the OSD, it's the MDS. :)

The MDS system is *slightly* less resilient than the OSD system is.
You can set up "standby" MDSes that will take over if the system
detects that an MDS has died; you can even set up "standby-replay"
MDSes that follow a specific MDS and keep all its data cached in
memory so they can take over right when a failure is detected. But if
you lose one MDS its data won't automatically be imported into the
remaining MDSes. (Because the MDS keeps all its data on the OSDs,
there's no danger of losing data -- it's a matter of how the data is
segregated that requires a new daemon. And generally the process is
dominated by the timeout, not the time it takes the new MDS to take
over.)
So in your case, you're trying to open a file that is controlled by
the MDS that you killed, and the client can't get the "capability"
bits that it needs in order to look at the file. So you've got a few
options:
1) Kill the OSD, but not the MDS.
2) Create an extra MDS daemon, perhaps on your monitor node. When the
system detects that one of your MDSes has died (a configurable
timeout, IIRC in the neighborhood of 30-60 seconds), this extra daemon
will take over.
(Or you can just start up the new daemon after you kill the old one,
doesn't matter.)
3) Create a new system with only one MDS and don't kill that one.
(Eventually you will be able to shrink the number of MDSes, but this
isn't well-tested or documented so I'm not sure what state it's in
right now.)

I recommend option 2 for maximum wow. ;)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?)
  2011-06-07  2:15 ` Gregory Farnum
@ 2011-06-09 10:23   ` Székelyi Szabolcs
  2011-06-09 16:16     ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Székelyi Szabolcs @ 2011-06-09 10:23 UTC (permalink / raw)
  To: ceph-devel

Hey Greg,

On 2011. June 7. 04:15:31 Gregory Farnum wrote:
> 2011/6/6 Székelyi Szabolcs <szekelyi@niif.hu>:
> > I have a three node ceph setup, two nodes playing all three roles (OSD,
> > MDS, MON), and one being just a monitor (which happens to be the client
> > I'm using the filesystem from).
> > 
> > I want to achieve high availablity by mirroring all data between the OSDs
> > and being able to still access everything even if one of them goes down.
> > The mirroring works fine, I see the space being consumed on both nodes
> > as I copy data on the file system. According to `ceph -s`, all PGs are
> > in active+clean state. If I start reading a big file and shut down one
> > of the (OSD+MDS+MON) nodes, the file can still be read until the end,
> > that's fine. Moreover, the contents read back seem correct when compared
> > to the original file. Very nice. But if I start reading the file while
> > one of the nodes is down, it blocks until the node comes up again. I
> > can't even kill the reading process with KILL, TERM, or INT.
> > 
> > Am I doing something wrong, or was not careful enough reading the docs,
> > or may this be a bug? My ceph.conf is attached.
> 
> The problem isn't in the OSD, it's the MDS. :)
> 
> The MDS system is *slightly* less resilient than the OSD system is.
> You can set up "standby" MDSes that will take over if the system
> detects that an MDS has died; you can even set up "standby-replay"
> MDSes that follow a specific MDS and keep all its data cached in
> memory so they can take over right when a failure is detected. But if
> you lose one MDS its data won't automatically be imported into the
> remaining MDSes. (Because the MDS keeps all its data on the OSDs,
> there's no danger of losing data -- it's a matter of how the data is
> segregated that requires a new daemon. And generally the process is
> dominated by the timeout, not the time it takes the new MDS to take
> over.)

Thanks for the clarification. I still have a few questions.

If I understand things correctly, Ceph tries to have max_mds number of MDSes 
active at all times. I can have more MDSes than this number, but the excess 
ones will be standby MDSes, right?

I can't really understand the difference between a standby and an active MDS. 
Now I have two active and no standby MDSes, and the filesystem stops working if 
I kill any of them. Does this mean that the system will stop working if it 
can't fill up the number of MDSes to max_mds from the standby pool?

What is the reason for running standby MDSes and not setting max_mds to the 
number of all MDSes?

> So in your case, you're trying to open a file that is controlled by
> the MDS that you killed, and the client can't get the "capability"
> bits that it needs in order to look at the file. So you've got a few
> options:
> 1) Kill the OSD, but not the MDS.

Well, if a machine crashes, then both fall victim. :(

> 2) Create an extra MDS daemon, perhaps on your monitor node. When the
> system detects that one of your MDSes has died (a configurable
> timeout, IIRC in the neighborhood of 30-60 seconds), this extra daemon
> will take over.

I will do this. Should this be a standby or an active MDS? Ie. should I 
increase max_mds from 2 to 3 after creating the new MDS?

> (Or you can just start up the new daemon after you kill the old one,
> doesn't matter.)
> 3) Create a new system with only one MDS and don't kill that one.
> (Eventually you will be able to shrink the number of MDSes, but this
> isn't well-tested or documented so I'm not sure what state it's in
> right now.)

This is not an option since it will create a SPOF and that's exactly the thing 
I'm trying to avoid by using Ceph.

Thanks,
-- 
cc

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?)
  2011-06-09 10:23   ` Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?) Székelyi Szabolcs
@ 2011-06-09 16:16     ` Gregory Farnum
  2011-06-21 15:04       ` Székelyi Szabolcs
  0 siblings, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2011-06-09 16:16 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

So the Ceph MDS system is a little odd. Presumably you are aware that
file data is stored as chunks (default 4MB) in objects on RADOS (ie,
the OSD system). Metadata is also stored on RADOS, where each
directory is a single object that contains all the inodes underneath
it.
In principle you could construct a filesystem that simply accessed
this on-disk metadata every time it changed everything. However, that
would be slow for a number of reasons. To speed up metadata ops, we
have the MetaData Server. Essentially, it does 3 things:
1) Cache the metadata in-memory to speed up read accesses.
2) Handle client locking of data/metadata accesses (this is the
capabilities system)
3) Journal metadata write operations so that we can get streaming
write latencies instead of random lookup-and-write latencies on
changes.

When you have multiple active MDSes, they partition the namespace so
that any given directory has a single "authoritative" MDS which is
responsible for handling requests. This simplifies locking and
prevents the MDSes from duplicating inodes in-cache. You can add MDSes
by increasing the max_mds number. (Most of the machinery is there to
reduce the number of MDSes too, but it's not all tied together.)

So you can see that if one daemon dies you're going to lose access to
the metadata it's responsible for. :( HOWEVER, all the data it's
"responsible" for resides in RADOS so you haven't actually lost any
state or the ability to access it, just the authorization to do so. So
we introduce standby and standby-replay MDSes. If you have more MDS
daemons than max_mds, the first max_mds daemons will start up and
become active and the rest will sit around as standbys. The monitor
service knows they exist and are available, but they don't do anything
until an active MDS dies. If an active MDS does die, the monitor
assigns one of the standbys to take over for that MDS -- it becomes
the same logical MDS but on a different host (or not, if you're
running multiple daemons on the same host or whatever). It goes
through a set of replay stages and then operates normally.
If this makes you shudder in horror from previous experiences with the
HDFS standby or something, fear not. :) We haven't done comprehensive
tests but replay is generally pretty short -- our default timeout on
an MDS is in the region of 30 seconds and I don't think I've ever seen
replay take more than 45.
If you want to reduce the time it takes even further you can assign
certain daemons to be in standby-replay mode. In this mode they
actively replay the journal of an active MDS and maintain the same
cache, so if the active MDS fails they only need to go through the
client reconnect stage.
Also remember that during this time access to metadata which resides
on other non-failed MDSes goes on uninterrupted. :)

In principle we could do some whiz-bang coding so that if an MDS fails
another active MDS takes over responsibility for its data and then
repartitions it out to the rest of the cluster, but that's not
something we're likely to do for a while given the complexity and the
relatively small advantage over using standby daemons.

So going through your questions specifically:
2011/6/9 Székelyi Szabolcs <szekelyi@niif.hu>:
> If I understand things correctly, Ceph tries to have max_mds number of MDSes
> active at all times. I can have more MDSes than this number, but the excess
> ones will be standby MDSes, right?
Right.

> I can't really understand the difference between a standby and an active MDS.
> Now I have two active and no standby MDSes, and the filesystem stops working if
> I kill any of them. Does this mean that the system will stop working if it
> can't fill up the number of MDSes to max_mds from the standby pool?
Hopefully the above answered this for you a little more precisely, but
the short answer is yes.

> What is the reason for running standby MDSes and not setting max_mds to the
> number of all MDSes?
Again, hopefully you got this above. But it increases system
resiliency in the case of a failure.

>> 2) Create an extra MDS daemon, perhaps on your monitor node. When the
>> system detects that one of your MDSes has died (a configurable
>> timeout, IIRC in the neighborhood of 30-60 seconds), this extra daemon
>> will take over.
>
> I will do this. Should this be a standby or an active MDS? Ie. should I
> increase max_mds from 2 to 3 after creating the new MDS?
Standby -- don't increase max_mds.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?)
  2011-06-09 16:16     ` Gregory Farnum
@ 2011-06-21 15:04       ` Székelyi Szabolcs
  0 siblings, 0 replies; 5+ messages in thread
From: Székelyi Szabolcs @ 2011-06-21 15:04 UTC (permalink / raw)
  To: ceph-devel

Hi Greg,

finally I had some time to try what you suggested. Adding a standby MDS fixed 
the problem.

Thanks a lot for your detailed clarification and excellent support. So far I 
couldn't find any SPOF in the system. Killing any daemon still kept the 
filesystem running and all the data available provided that it was 
theoretically possible. After the failure was corrected, the system was able 
to fully recover. I did things like stopping and starting the daemons while 
writing to or reading from the filesystem, comparing read back data with the 
original.

Thanks & keep up the good work,
-- 
cc

On 2011. June 9. 18:16:38 Gregory Farnum wrote:
> So the Ceph MDS system is a little odd. Presumably you are aware that
> file data is stored as chunks (default 4MB) in objects on RADOS (ie,
> the OSD system). Metadata is also stored on RADOS, where each
> directory is a single object that contains all the inodes underneath
> it.
> In principle you could construct a filesystem that simply accessed
> this on-disk metadata every time it changed everything. However, that
> would be slow for a number of reasons. To speed up metadata ops, we
> have the MetaData Server. Essentially, it does 3 things:
> 1) Cache the metadata in-memory to speed up read accesses.
> 2) Handle client locking of data/metadata accesses (this is the
> capabilities system)
> 3) Journal metadata write operations so that we can get streaming
> write latencies instead of random lookup-and-write latencies on
> changes.
> 
> When you have multiple active MDSes, they partition the namespace so
> that any given directory has a single "authoritative" MDS which is
> responsible for handling requests. This simplifies locking and
> prevents the MDSes from duplicating inodes in-cache. You can add MDSes
> by increasing the max_mds number. (Most of the machinery is there to
> reduce the number of MDSes too, but it's not all tied together.)
> 
> So you can see that if one daemon dies you're going to lose access to
> the metadata it's responsible for. :( HOWEVER, all the data it's
> "responsible" for resides in RADOS so you haven't actually lost any
> state or the ability to access it, just the authorization to do so. So
> we introduce standby and standby-replay MDSes. If you have more MDS
> daemons than max_mds, the first max_mds daemons will start up and
> become active and the rest will sit around as standbys. The monitor
> service knows they exist and are available, but they don't do anything
> until an active MDS dies. If an active MDS does die, the monitor
> assigns one of the standbys to take over for that MDS -- it becomes
> the same logical MDS but on a different host (or not, if you're
> running multiple daemons on the same host or whatever). It goes
> through a set of replay stages and then operates normally.
> If this makes you shudder in horror from previous experiences with the
> HDFS standby or something, fear not. :) We haven't done comprehensive
> tests but replay is generally pretty short -- our default timeout on
> an MDS is in the region of 30 seconds and I don't think I've ever seen
> replay take more than 45.
> If you want to reduce the time it takes even further you can assign
> certain daemons to be in standby-replay mode. In this mode they
> actively replay the journal of an active MDS and maintain the same
> cache, so if the active MDS fails they only need to go through the
> client reconnect stage.
> Also remember that during this time access to metadata which resides
> on other non-failed MDSes goes on uninterrupted. :)
> 
> In principle we could do some whiz-bang coding so that if an MDS fails
> another active MDS takes over responsibility for its data and then
> repartitions it out to the rest of the cluster, but that's not
> something we're likely to do for a while given the complexity and the
> relatively small advantage over using standby daemons.
> 
> So going through your questions specifically:
> 
> 2011/6/9 Székelyi Szabolcs <szekelyi@niif.hu>:
> > If I understand things correctly, Ceph tries to have max_mds number of
> > MDSes active at all times. I can have more MDSes than this number, but
> > the excess ones will be standby MDSes, right?
> 
> Right.
> 
> > I can't really understand the difference between a standby and an active
> > MDS. Now I have two active and no standby MDSes, and the filesystem
> > stops working if I kill any of them. Does this mean that the system will
> > stop working if it can't fill up the number of MDSes to max_mds from the
> > standby pool?
> 
> Hopefully the above answered this for you a little more precisely, but
> the short answer is yes.
> 
> > What is the reason for running standby MDSes and not setting max_mds to
> > the number of all MDSes?
> 
> Again, hopefully you got this above. But it increases system
> resiliency in the case of a failure.
> 
> >> 2) Create an extra MDS daemon, perhaps on your monitor node. When the
> >> system detects that one of your MDSes has died (a configurable
> >> timeout, IIRC in the neighborhood of 30-60 seconds), this extra daemon
> >> will take over.
> > 
> > I will do this. Should this be a standby or an active MDS? Ie. should I
> > increase max_mds from 2 to 3 after creating the new MDS?
> 
> Standby -- don't increase max_mds.
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-06-21 15:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-07  0:37 Degraded PGs blocking open()? Székelyi Szabolcs
2011-06-07  2:15 ` Gregory Farnum
2011-06-09 10:23   ` Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?) Székelyi Szabolcs
2011-06-09 16:16     ` Gregory Farnum
2011-06-21 15:04       ` Székelyi Szabolcs

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.