All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
@ 2006-01-09 22:39 Jeff Mahoney
  2006-01-10  4:29 ` Mark Fasheh
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Mahoney @ 2006-01-09 22:39 UTC (permalink / raw)
  To: ocfs2-devel


Hello all -

As mentioned in my last email, here are my patches for implementing a userspace
clustering interface.

These should be considered early beta, but I very much welcome comment.

A quick preview:
01 - event driven quorum: o2net will no longer call into quorum
     directly, but rather generate events that quorum will hook into.
     Unfortunately, I've run into a bit of a snag with this since there
     are two places where recursive events can be generated (ie: a
     connection event generated when handling a node up/down event) and
     that causes deadlocks on the o2hb_callback_sem. This is really
     the only patch the entire series is waiting on sorting out.
02 - introduce generic heartbeat resource: initially, this will
     just contain a config_item and will replace the config_item
     in o2hb_region. Eventually, it will be used as a handle for a
     generic heartbeat resource, including several operations.
03 - split disk heartbeat out from the generic heartbeat: They'll still
     be closely tied, but going their separate ways. This patch
     intentionally does very little other than move code around without
     modifying it.
04 - add a heartbeat registration API: This expands the generic
     heartbeat group structure to include the type information as well
     as a few operations necessary to abstract the heartbeat resource.
     In addition, it adds a mechanism for registering a group mode. It
     uses the first mode loaded. Since disk is the only mode at this
     point, there is no way to switch. This will be added later.
05 - add per-resource events: callbacks can define that they only want
     events from a particular heartbeat resource, and will only receive
     events for those. This is useful for only sending the file system
     the events from the heartbeat resource it's listening to.
06 - per-resource membership: fill_node_map can take a resource name
     (UUID) to use for filling the membership bitmap passed in. If NULL
     is passed, it uses a global up/down. No changes to the disk
     heartbeat other than prototype changes are needed, since it still
     keeps a global membership.
07 - o2net refcounted disconnect: Rather than disconnect when a node
     down event is caught by o2net, it waits until the last reference
     is dropped. This is useful for userspace heartbeat since it can
     take down a disk resource but the network resource will still be
     available.
08 - add check_node_status: The userspace heartbeat implementation
     allows the caller to check on a per-node, per-resource basis if
     a particular node is up. Building the global list is a bigger deal,
     so when that is avoidable, it does so.
09 - add /sys/o2cb/heartbeat_mode: This patch allows the user to select
     which mode heartbeat will use. It requires that the change be made
     before the cluster is created.
10 - add userspace clustering: The real goal of all this. This will
     allow the user to create heartbeat directories as before, but
     rather than supplying disk information, it allows the user to
     create symlinks to communicate the current node membership for
     a given heartbeat group. Since configfs doesn't allow dangling
     symlinks, this is an easy way to intuitively configure heartbeat
     resources from userspace. Node UP events are generated when a link
     is created and node DOWN events are generated when a link is
     removed.

-Jeff

--
Jeff Mahoney
SUSE Labs
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
  2006-01-09 22:39 [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface Jeff Mahoney
@ 2006-01-10  4:29 ` Mark Fasheh
  2006-01-10  4:51   ` Jeffrey Mahoney
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Fasheh @ 2006-01-10  4:29 UTC (permalink / raw)
  To: ocfs2-devel

Hi Jeff,
	Thanks for sending all these patches out. Once patch 3 is in the
mailman moderation queue, I'll be sure to let it through - last time was my
fault as I accidentally deleted it along with the millions of spam messages
that got caught in there. I'll start with some higher level commentary while
I try to absorb the patchset :) More commentary will come later for sure.

To get the most nit-picky request out of the way, I noticed that many of the
functions you add (including file system functions) don't have a prefix.
It'd be nice if you could keep that consistent with the rest of the code in
their respective files.

On to more important things:

I'm a bit worried about the new methods for querying heartbeat information,
specifically that things are jumping from all heartbeat status being global
(in the sense that it's collated into one giant map) to it being specific to
a given region. Things like the dlm domain joining code have expected it to
be global for some time now. Tcp had a similar assumption which you had to
fix in patch #8. Of course there, it was easy to work around. I need to
think more on this. Things might actually be ok, but it's not something I
expected to change.

Is there any userspace source available that makes use of this yet? Hmm, I
see that you sent a description of what's required from userspace. Perhaps
that'll answer some more questions :)
	--Mark

On Mon, Jan 09, 2006 at 05:39:42PM -0500, Jeff Mahoney wrote:
> 
> Hello all -
> 
> As mentioned in my last email, here are my patches for implementing a userspace
> clustering interface.
> 
> These should be considered early beta, but I very much welcome comment.
> 
> A quick preview:
> 01 - event driven quorum: o2net will no longer call into quorum
>      directly, but rather generate events that quorum will hook into.
>      Unfortunately, I've run into a bit of a snag with this since there
>      are two places where recursive events can be generated (ie: a
>      connection event generated when handling a node up/down event) and
>      that causes deadlocks on the o2hb_callback_sem. This is really
>      the only patch the entire series is waiting on sorting out.
> 02 - introduce generic heartbeat resource: initially, this will
>      just contain a config_item and will replace the config_item
>      in o2hb_region. Eventually, it will be used as a handle for a
>      generic heartbeat resource, including several operations.
> 03 - split disk heartbeat out from the generic heartbeat: They'll still
>      be closely tied, but going their separate ways. This patch
>      intentionally does very little other than move code around without
>      modifying it.
> 04 - add a heartbeat registration API: This expands the generic
>      heartbeat group structure to include the type information as well
>      as a few operations necessary to abstract the heartbeat resource.
>      In addition, it adds a mechanism for registering a group mode. It
>      uses the first mode loaded. Since disk is the only mode at this
>      point, there is no way to switch. This will be added later.
> 05 - add per-resource events: callbacks can define that they only want
>      events from a particular heartbeat resource, and will only receive
>      events for those. This is useful for only sending the file system
>      the events from the heartbeat resource it's listening to.
> 06 - per-resource membership: fill_node_map can take a resource name
>      (UUID) to use for filling the membership bitmap passed in. If NULL
>      is passed, it uses a global up/down. No changes to the disk
>      heartbeat other than prototype changes are needed, since it still
>      keeps a global membership.
> 07 - o2net refcounted disconnect: Rather than disconnect when a node
>      down event is caught by o2net, it waits until the last reference
>      is dropped. This is useful for userspace heartbeat since it can
>      take down a disk resource but the network resource will still be
>      available.
> 08 - add check_node_status: The userspace heartbeat implementation
>      allows the caller to check on a per-node, per-resource basis if
>      a particular node is up. Building the global list is a bigger deal,
>      so when that is avoidable, it does so.
> 09 - add /sys/o2cb/heartbeat_mode: This patch allows the user to select
>      which mode heartbeat will use. It requires that the change be made
>      before the cluster is created.
> 10 - add userspace clustering: The real goal of all this. This will
>      allow the user to create heartbeat directories as before, but
>      rather than supplying disk information, it allows the user to
>      create symlinks to communicate the current node membership for
>      a given heartbeat group. Since configfs doesn't allow dangling
>      symlinks, this is an easy way to intuitively configure heartbeat
>      resources from userspace. Node UP events are generated when a link
>      is created and node DOWN events are generated when a link is
>      removed.
> 
> -Jeff
> 
> --
> Jeff Mahoney
> SUSE Labs
>  
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
  2006-01-10  4:29 ` Mark Fasheh
@ 2006-01-10  4:51   ` Jeffrey Mahoney
  2006-01-10 10:43     ` Lars Marowsky-Bree
  0 siblings, 1 reply; 7+ messages in thread
From: Jeffrey Mahoney @ 2006-01-10  4:51 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
> Hi Jeff,
> 	Thanks for sending all these patches out. Once patch 3 is in the
> mailman moderation queue, I'll be sure to let it through - last time was my
> fault as I accidentally deleted it along with the millions of spam messages
> that got caught in there. I'll start with some higher level commentary while
> I try to absorb the patchset :) More commentary will come later for sure.
> 
> To get the most nit-picky request out of the way, I noticed that many of the
> functions you add (including file system functions) don't have a prefix.
> It'd be nice if you could keep that consistent with the rest of the code in
> their respective files.
> 
> On to more important things:
> 
> I'm a bit worried about the new methods for querying heartbeat information,
> specifically that things are jumping from all heartbeat status being global
> (in the sense that it's collated into one giant map) to it being specific to
> a given region. Things like the dlm domain joining code have expected it to
> be global for some time now. Tcp had a similar assumption which you had to
> fix in patch #8. Of course there, it was easy to work around. I need to
> think more on this. Things might actually be ok, but it's not something I
> expected to change.

Ok, well that's easy enough to back off in dlm. If the resource name is
NULL, a global request is performed. If the requests lean more to the
global side than the specific requests, and they seem to, perhaps I
should introduce a counter for nodes in the user heartbeat code so that
a quick analysis is easier to come by.

> Is there any userspace source available that makes use of this yet? Hmm, I
> see that you sent a description of what's required from userspace. Perhaps
> that'll answer some more questions :)

Unfortunately, not yet. I've been focusing on the kernel component and
need to work with Lars Marowsky-Bree and Andrew Beekhof on integration
with the hb2 code.

The userspace requirements are kind of steep, but most of that code is
the cluster manager itself and is already written and well tested. The
OCFS2-specific stuff just needs to be able to setup/tear down resources
and then add/remove the links for node up/down events. That part of
things shouldn't be too difficult.


-Jeff

--
Jeff Mahoney
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
  2006-01-10  4:51   ` Jeffrey Mahoney
@ 2006-01-10 10:43     ` Lars Marowsky-Bree
  2006-01-10 19:08       ` Mark Fasheh
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-10 10:43 UTC (permalink / raw)
  To: ocfs2-devel

On 2006-01-09T23:51:46, Jeffrey Mahoney <jeffm@suse.com> wrote:

> > Is there any userspace source available that makes use of this yet? Hmm, I
> > see that you sent a description of what's required from userspace. Perhaps
> > that'll answer some more questions :)
> Unfortunately, not yet. I've been focusing on the kernel component and
> need to work with Lars Marowsky-Bree and Andrew Beekhof on integration
> with the hb2 code.

We don't have working user-space code for integrating with the new OCFS2
interface by Jeff yet :-( However, we've been working together to make
sure the interface is "right" for us to use - the good thing about the
new API is that in theory it can be driven from shell scripts for
testing w/no cluster involved at all ;-)

We're a bit caught up in general deadline frenzy right now, but intend
to have working code for driving OCFS2 within the next 2-3 weeks I
guess. As Jeff said, the major parts of the stack are done already, just
the integration piece seems missing...

So, I'd be grateful if you could tell us whether you consider the
direction where this is taking OCFS2 evil, acceptable or wonderful - if
the first, us pursuing that direction would be a waste of time and we'd
need to invent something else, quickly ;-) If merely acceptable, we also
should consider whether we can improve.



Sincerely,
    Lars Marowsky-Br?e

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
  2006-01-10 10:43     ` Lars Marowsky-Bree
@ 2006-01-10 19:08       ` Mark Fasheh
  2006-02-01 12:27         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Fasheh @ 2006-01-10 19:08 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Jan 10, 2006 at 11:43:19AM +0100, Lars Marowsky-Bree wrote:
> We don't have working user-space code for integrating with the new OCFS2
> interface by Jeff yet :-( However, we've been working together to make
> sure the interface is "right" for us to use - the good thing about the
> new API is that in theory it can be driven from shell scripts for
> testing w/no cluster involved at all ;-)
Ok. Personally I'd like to see the beginnings of that before pushing your
patch into our tree, but I think we're still at the early stages of review
anyway.

> We're a bit caught up in general deadline frenzy right now, but intend
> to have working code for driving OCFS2 within the next 2-3 weeks I
> guess. As Jeff said, the major parts of the stack are done already, just
> the integration piece seems missing...
Heh, I know what that frenzy can be like :)

> So, I'd be grateful if you could tell us whether you consider the
> direction where this is taking OCFS2 evil, acceptable or wonderful - if
> the first, us pursuing that direction would be a waste of time and we'd
> need to invent something else, quickly ;-) If merely acceptable, we also
> should consider whether we can improve.
Oh, I think folks over here are pretty happy with the direction you've been
headed in. Of course there'll always be some things to change, etc. But if
you're worried that anyone here is considering it out and out wrong, that's
certainly not the case.

What I'm trying to do right now is list out all the requirements that OCFS2
and the DLM has of their cluster stack so I can examine those one by one
against your patch.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface
  2006-01-10 19:08       ` Mark Fasheh
@ 2006-02-01 12:27         ` Lars Marowsky-Bree
       [not found]           ` <43E0D25D.9080105@unix.sh>
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Marowsky-Bree @ 2006-02-01 12:27 UTC (permalink / raw)
  To: ocfs2-devel

On 2006-01-10T11:08:47, Mark Fasheh <mark.fasheh@oracle.com> wrote:

> > We don't have working user-space code for integrating with the new OCFS2
> > interface by Jeff yet :-( However, we've been working together to make
> > sure the interface is "right" for us to use - the good thing about the
> > new API is that in theory it can be driven from shell scripts for
> > testing w/no cluster involved at all ;-)
> Ok. Personally I'd like to see the beginnings of that before pushing your
> patch into our tree, but I think we're still at the early stages of review
> anyway.

Actually, I like forgot to post this here. I'm cross-posting to
linux-ha-dev for comments, too.

I'll once again summarize the approach we're taking for supporting
OCFS2; it falls apart into several steps:

1.) We allow the heartbeat groups to be controlled by, well,
heartbeat/CRM via a "clone" Resource Agent, and then perform
mounts/unmounts. o2cb is still used in this scheme to populate the node
tree. This implements the top/down approach - CRM controls OCFS2 mounts
completely - and is a pre-requisite to the next steps. All OCFS2 mounts
have to be configured in the CRM XML configuration like other
filesystems too.

(This step is coded; I'll go into it after the overview.)

1.5.) We have noticed that we want to restructure some calling
conventions of the Resource Agents for this case (amazing what you
notice when you actually go and _implement_ some grand design! ;-). But
this doesn't affect the general mechanism at all and is just mentioned
for completeness.


2.) "o2cb" is replaced; we auto-discover the IPs of participating nodes
and populate the node tree automatically as required. This removes the
need to configure & sync anything outside hb2 for OCFS2. Well; except
for calling mkfs once, somewhere.

(The filesystem will have to be told which NIC / label to use as part of
the configuration, though, but that's easy.)

3.) We hook into mount/umount so that when the admin issues such a manual
command for OCFS2, we call into the CRM and either instantiate an OCFS2
locally or umount it (and stop anything on top, if needed, at least if
invoked with -f).

4.) Follows logically from 3.: When the admin tries to mount a
filesystem which we don't know about yet (fsid not in our
configuration), we create the required object from a default template,
and then proceed as above. At this stage, mkfs/mount/umount will provide
the complete look & feel of a regular filesystem.


Further options we might then explore, mostly optimizations:

5.) Right now, the OCFS2 is driven via the regular Resource Agent
mechanism. That implies calling (exec) out to an external agent;
ultimately, I'd like to have a "fast" RA interface talking to a
pre-loaded plugin for efficiency. Again this is mostly an internal
optimization and doesn't really affect the overall design.

6.) Right now, OCFS2's node manager only allows a single cluster. We've
briefly toyed with the thought of having several, which then could use
independent network links for example for performance. Not sure whether
this is useful at all.


Going back to how step 1 is implemented now.

So, as any Cluster Manager, we already have the ability to mount/umount
filesystems of course. I went in and extended this to support our
"clones" (http://linux-ha.org/v2/Concepts/Clones) for use with OCFS2.
Clones are essentially regular resources, but which can be instantiated
more than once; and one can tell the system to provide the clones with
notifications when we do something to their instances on another nodes.

In particular, they get told where else in the cluster their instances
are running. So, what we have to do follows naturally from that, and I'm
quoting the comment in the code at you so I don't have to type it
twice:

	# Process notifications; this is the essential glue level for
	# giving user-space membership events to a cluster-aware
	# filesystem. Right now, only OCFS2 is supported.
	#
	# We get notifications from hb2 that some operation (start or
	# stop) has completed; we then (1) compare the list of nodes
	# which are active in the fs membership with the list of nodes
	# which hb2 wants to be participating and remove those which
	# aren't supposed to be around. And vice-versa, (2) we add nodes
	# which aren't yet members, but which hb2 _does_ want to be
	# active.
	#
	# Eventually, if (3) we figure that we ourselves are on the list
	# of nodes which weren't active yet, we initiate a mount
	# operation.
	#
	# That's it.
	#
	# This approach _does_ have the advantage of being rather
	# robust, I hope. We always re-sync the current membership with
	# the expected membership.
	#
	# Note that this expects that the base cluster is already
	# active; ie o2cb has been started and populated
	# $OCFS2_CLUSTER_ROOT/node/ already. This can be achieved by
	# simply having o2cb run on all nodes by the CRM too.  This
	# probably ought to be mentioned somewhere in the to be written
	# documentation. ;-)
	#

On "stop", we simply umount locally and then remove the heartbeat group
completely. On being notified of a "stop" from another node, the above
logic kicks in and will rempove the node from the heartbeat group.

So, this isn't that difficult, despite being implemented in bash. ;-)

Now, how does this look in the configuration for an 8 node cluster (or
actually, a scenario where you want 8 nodes to be able to mount the fs)?
XML haters beware - and remember that a) this is step 1, b) that it is
very close to how filesystems are configured regularly, and c) that
heartbeat does have a python GUI too:

<clone id="exp1" notify="1" notify_confirm="1">
  <instance_attributes>
    <attributes>
      <nvpair name="clone_max" value="8"/>
      <nvpair name="clone_node_max" value="1"/>
    </attributes>
  </instance_attributes>
  <primitive id="rsc2" class="ocf" type="Filesystem">
    <operations>
      <op id="dfs-op-1" interval="120s" name="monitor" timeout="60s"/>
      <op id="fs-op-2" interval="120s" name="notify" timeout="60s"/>
    </operations>
    <instance_attributes>
      <attributes>
        <nvpair id="fs-attr-1" name="device" value="/dev/sda1"/>
        <nvpair id="fs-attr-3" name="directory" value="/srv/www"/>
        <nvpair id="fs-attr-4" name="fstype" value="ocfs2"/>
      </attributes>
    </instance_attributes>
  </primitive>
</clone>

Just the "clone" object surrounding the Filesystem resource is new, plus
the "notify" operation (for which non-cluster filesystems have no use);
the rest is absolutely identical.

I'll test this code some more, and we're currently trying to push out
heartbeat 2.0.3 this week - so I can't go in and commit such a big
change to the Filesystem agent. But, this will appear early in 2.0.4,
probably after next week (I'm on "vacation" without network access)
then.

I've attached the diff to the Filesystem agent from my current
workspace. This is meant for illustration only; I've screwed up (ie,
deleted) my testbed and so it probably doesn't work because of typos
;-)


Sincerely,
    Lars Marowsky-Br?e

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-------------- next part --------------
Index: resources/OCF/Filesystem.in
===================================================================
RCS file: /home/cvs/linux-ha/linux-ha/resources/OCF/Filesystem.in,v
retrieving revision 1.14
diff -u -p -r1.14 Filesystem.in
--- resources/OCF/Filesystem.in	26 Jan 2006 18:00:05 -0000	1.14
+++ resources/OCF/Filesystem.in	1 Feb 2006 12:26:42 -0000
@@ -145,9 +145,31 @@ Any extra options to be given as -o opti
 </parameter>
 </parameters>
 
+<parameter name="ocfs2_cluster" unique="0">
+<longdesc lang="en">
+The name (UUID) of the OCFS2 cluster this filesystem is part of,
+iff this is an OCFS2 resource and there's more than one cluster. You
+should not need to specify this.
+</longdesc>
+<shortdesc lang="en">OCFS2 cluster name/UUID</shortdesc>
+<content type="string" default="" />
+</parameter>
+</parameters>
+
+<parameter name="ocfs2_configfs" unique="0">
+<longdesc lang="en">
+Mountpoint of the cluster hierarchy below configfs. You should not
+need to specify this.
+</longdesc>
+<shortdesc lang="en">OCFS2 configfs root</shortdesc>
+<content type="string" default="" />
+</parameter>
+</parameters>
+
 <actions>
 <action name="start" timeout="60" />
 <action name="stop" timeout="60" />
+<action name="notify" timeout="60" />
 <action name="status" depth="0" timeout="10" interval="10" start-delay="10" />
 <action name="monitor" depth="0" timeout="10" interval="10" start-delay="10" />
 <action name="validate-all" timeout="5" />
@@ -167,16 +189,10 @@ END
 #
 flushbufs() {
   if
-    [ "$BLOCKDEV" != "" -a -x "$BLOCKDEV" ]
+    [ "$BLOCKDEV" != "" -a -x "$BLOCKDEV" -a "$blockdevice" = "yes" ]
   then
-    case $1 in
-      -*|[^/]*:/*|//[^/]*/*)	# -U, -L options to mount, or NFS mount point,
-				# or samba mount point	
-			;;
-      *)		$BLOCKDEV --flushbufs $1
-			return $?
-			;;
-    esac
+    $BLOCKDEV --flushbufs $1
+    return $?
   fi
   
   return 0
@@ -187,6 +203,13 @@ flushbufs() {
 #
 Filesystem_start()
 {
+	if [ "$FSTYPE" = "ocfs2" ] && [ -z "$OCFS2_DO_MOUNT" ]; then
+		# Sorry, start doesn't actually do anything here. Magic
+		# happens in Filesystem_notify; see the comment there.
+		ocf_log debug "$DEVICE: ocfs2 - skipping start."
+		return $OCF_SUCCESS
+	fi		
+
 	# See if the device is already mounted.
 #$MOUNT | cut -d' ' -f3 | grep -e "^$MOUNTPOINT$" >/dev/null
 	Filesystem_status >/dev/null 2>&1
@@ -196,6 +219,8 @@ Filesystem_start()
 	fi
 
 	# Insert SCSI module
+	# TODO: This probably should go away. Why should the filesystem
+	# RA magically load a kernel module?
 	$MODPROBE scsi_hostadapter >/dev/null 2>&1
 
 	if [ -z $FSTYPE ]; then
@@ -222,7 +247,7 @@ Filesystem_start()
 
 	if
 	  case $FSTYPE in
-	    ext3|reiserfs|xfs|jfs|vfat|fat|nfs|cifs|smbfs)	false;;
+	    ext3|reiserfs|reiser4|nss|xfs|jfs|vfat|fat|nfs|cifs|smbfs|ocfs2)	false;;
 	    *)				true;;
 	  esac
         then
@@ -266,11 +291,154 @@ Filesystem_start()
 }
 # end of Filesystem_start
 
+Filesystem_notify() {
+	# Process notifications; this is the essential glue level for
+	# giving user-space membership events to a cluster-aware
+	# filesystem. Right now, only OCFS2 is supported.
+	#
+	# We get notifications from hb2 that some operation (start or
+	# stop) has completed; we then (1) compare the list of nodes
+	# which are active in the fs membership with the list of nodes
+	# which hb2 wants to be participating and remove those which
+	# aren't supposed to be around. And vice-versa, (2) we add nodes
+	# which aren't yet members, but which hb2 _does_ want to be
+	# active.
+	#
+	# Eventually, if (3) we figure that we ourselves are on the list
+	# of nodes which weren't active yet, we initiate a mount
+	# operation.
+	#
+	# That's it.
+	#
+	# If you wonder why we don't process pre-notifications, or don't
+	# do anything in "start": pre-start doesn't help us, because we
+	# don't get it on the node just starting. pre-stop doesn't help
+	# us either, because we can't remove any nodes while still
+	# having the fs mounted. And because we can't mount w/o the
+	# membership populated, we have to wait for the post-start
+	# event.
+	# 
+	# This approach _does_ have the advantage of being rather
+	# robust, I hope. We always re-sync the current membership with
+	# the expected membership.
+	#
+	# Note that this expects that the base cluster is already
+	# active; ie o2cb has been started and populated
+	# $OCFS2_CLUSTER_ROOT/node/ already. This can be achieved by
+	# simply having o2cb run on all nodes by the CRM too.  This
+	# probably ought to be mentioned somewhere in the to be written
+	# documentation. ;-)
+	#
+
+	if [ "$FSTYPE" != "ocfs2" ]; then
+		# One of the cases which shouldn't occur; it should have
+		# been caught much earlier. Still, you know ...
+		ocf_log err "$DEVICE: Notification received for non-ocfs2 mount."
+		return $OCF_ERR_GENERIC
+	fi
+
+	local n_type="$OCF_RESKEY_notify_type"
+	local n_op="$OCF_RESKEY_notify_operation"
+	local n_active="$OCF_RESKEY_notify_active_uname"
+
+	ocf_log debug "$OCFS2_UUID - notify: $n_type for $n_op - active on $n_active"
+
+	if [ "$n_type" != "post" ]; then
+		ocf_log debug "$OCFS2_UUID: ignoring pre-notify."
+		return $OCF_SUCCESS
+	fi
+
+	local n_myself=${HA_CURHOST:-$(uname -n | tr A-Z a-z)}
+	ocf_log debug "$OCFS2_UUID: I am node $n_myself."
+
+	case " $n_active " in
+	*" $n_myself "*) ;;
+	*)	ocf_log err "$OCFS2_UUID: $n_myself (local) not on active list!"
+		return $OCF_ERR_GENERIC
+		;;
+	esac
+
+	# (1)
+	if [ -d "$OCFS2_FS_ROOT" ]; then
+	entry_prefix=$OCFS2_FS_ROOT/
+	for entry in $OCFS2_FS_ROOT/* ; do
+		n_fs="${entry##$entry_prefix}"
+		ocf_log debug "$OCFS2_UUID: Found node $n_fs"
+		case " $n_active " in
+		*" $n_fs "*)
+			# Construct a list of nodes which are present
+			# already in the membership.
+			n_exists="$n_exists $n_fs"
+			ocf_log debug "$OCFS2_UUID: Keeping node: $n_fs"
+			;;
+		*)
+			# Node is in the membership currently, but not on our 
+			# active list. Must be removed.
+			if [ "$n_op" = "start" ]; then
+				ocf_log warn "$OCFS2_UUID: Removing nodes on start"
+			fi
+			ocf_log info "$OCFS2_UUID: Removing dead node: $n_fs"
+			if rm -f $entry ; then
+				ocf_log debug "$OCFS2_UUID: Removal of $n_fs ok."
+			else
+				ocf_log err "$OCFS2_UUID: Removal of $n_fs failed!"
+			fi
+			;;
+		esac
+	done
+	else
+		ocf_log info "$OCFS2_UUID: Doesn't exist yet, creating."
+		mkdir -p $OCFS2_UUID
+	fi
+
+	ocf_log debug "$OCFS2_UUID: Nodes which already exist: $n_exists"
+	
+	# (2)
+	for entry in $n_active ; do
+		ocf_log debug "$OCFS2_UUID: Expected active node: $entry"
+		case " $n_exists " in
+		*" $entry "*)
+			ocf_log debug "$OCFS2_UUID: Already active: $entry"
+			;;
+		*)
+			if [ "$n_op" = "stop" ]; then
+				ocf_log warn "$OCFS2_UUID: Adding nodes on stop"
+			fi
+			ocf_log info "$OCFS2_UUID: Activating node: $entry"
+			if ! ln -s $OCFS2_CLUSTER_ROOT/node/$entry $OCFS2_UUID/$entry ; then
+				ocf_log err "$OCFS2_CLUSTER_ROOT/node/$entry: failed to link"
+				# exit $OCF_ERR_GENERIC
+			fi
+			
+			if [ "$entry" = "$n_myself" ]; then
+				OCFS2_DO_MOUNT=yes
+				ocf_log debug "$OCFS2_UUID: To be mounted."
+			fi	
+			;;
+		esac
+	done
+
+	# (3)
+	# For now, always unconditionally go ahead; we're here, so we
+	# should have the fs mounted. In theory, it should be fine to
+	# only do this when we're activating ourselves, but what if
+	# something went wrong, and we're in the membership but don't
+	# have the fs mounted? Can this happen? TODO
+	OCFS2_DO_MOUNT="yes"
+	if [ -n "$OCFS2_DO_MOUNT" ]; then
+		Filesystem_start
+	fi
+}
+
 #
 # STOP: Unmount the filesystem
 #
 Filesystem_stop()
 {
+	# TODO: We actually need to free up anything mounted on top of
+	# us too, and clear nfs exports of ourselves; otherwise, our own
+	# unmount process may be blocked.
+	
 	# See if the device is currently mounted
 	if
 		Filesystem_status >/dev/null 2>&1
@@ -303,6 +471,7 @@ Filesystem_stop()
 		DEV=`$MOUNT | grep "on $MOUNTPOINT " | cut -d' ' -f1`
 		# Unmount the filesystem
 		$UMOUNT $MOUNTPOINT
+		rc=$?
 	    fi
 		if [ $? -ne 0 ] ; then
 			ocf_log err "Couldn't unmount $MOUNTPOINT"
@@ -313,7 +482,18 @@ Filesystem_stop()
 		: $MOUNTPOINT Not mounted.  No problema!
 	fi
 
-	return $?
+	# We'll never see the post-stop notification. We're gone now,
+	# have unmounted, and thus should remove the membership.
+	if [ "$FSTYPE" = "ocfs2" ]; then
+		if [ ! -d "$OCFS2_FS_ROOT" ]; then
+			ocf_log info "$OCFS2_FS_ROOT: Filesystem membership already gone."
+		else
+			ocf_log info "$OCFS2_FS_ROOT: Removing membership directory."
+			rm -rf $OCFS2_FS_ROOT/
+		fi
+	fi
+	
+	return $rc
 }
 # end of Filesystem_stop
 
@@ -339,6 +519,10 @@ Filesystem_status()
           msg="$MOUNTPOINT is unmounted (stopped)"
         fi
 
+	# TODO: For ocfs2, or other cluster filesystems, should we be
+	# checking connectivity to other nodes here, or the IO path to
+	# the storage?
+	
         case "$OP" in
 	  status)	ocf_log info "$msg";;
 	esac
@@ -383,6 +567,63 @@ Filesystem_validate_all()
 	return $OCF_SUCCESS
 }
 
+ocfs2_init()
+{
+	# Check & initialize the OCFS2 specific variables.
+	if [ -z "$OCF_RESKEY_clone_max" ]; then
+		ocf_log err "ocfs2 must be run as a clone."
+		exit $OCF_ERR_GENERIC
+	fi
+
+	if [ $blockdevice = "no" ]; then
+		ocf_log err "$DEVICE: ocfs2 needs a block device instead."
+		exit $OCF_ERR_GENERIC
+	fi
+	
+	for f in "$OCF_RESKEY_ocfs2_configfs" /sys/kernel/config/cluster /configfs/cluster ; do
+		if [ -n "$f" -a -d "$f" ]; then
+			OCFS2_CONFIGFS="$f"
+			ocf_log debug "$OCFS2_CONFIGFS: used as configfs root."
+			break
+		fi
+	done
+	if [ ! -d "$OCFS2_CONFIGFS" ]; then
+		ocf_log err "ocfs2 needs configfs mounted."
+		exit $OCF_ERR_GENERIC
+	fi
+
+	OCFS2_UUID=$(mounted.ocfs2 -d $DEVICE|tail -1|awk '{print $3}'|tr -d -- -|tr a-z A-Z)
+	if [ -z "$OCFS2_UUID" ]; then
+		ocf_log err "$DEVICE: Could not determine ocfs2 UUID."
+		exit $OCF_ERR_GENERIC
+	fi
+	
+	if [ -n "$OCF_RESKEY_ocfs2_cluster" ]; then
+		OCFS2_CLUSTER=$(echo $OCF_RESKEY_ocfs2_cluster | tr a-z A-Z)
+	else
+		OCFS2_CLUSTER=$(find /tmp -maxdepth 1 -mindepth 1 -type d 2>&1)
+		set -- $OCFS2_CLUSTER
+		local n="$#"
+		if [ $n -gt 1 ]; then
+			ocf_log err "$OCFS2_CLUSTER: several clusters found."
+			exit $OCF_ERR_GENERIC
+		fi
+		if [ $n -eq 0 ]; then
+			ocf_log err "$OCFS2_CONFIGFS: no clusters found."
+			exit $OCF_ERR_GENERIC
+		fi
+	fi
+	ocf_log debug "$DEVICE: using cluster $OCFS2_CLUSTER"
+
+	OCFS2_CLUSTER_ROOT="$OCFS2_CONFIGFS/$OCFS2_CLUSTER"
+	if [ ! -d "$OCFS2_CLUSTER_ROOT" ]; then
+		ocf_log err "$OCFS2_CLUSTER: Cluster doesn't exist. Maybe o2cb hasn't been run?"
+		exit $OCF_ERR_GENERIC
+	fi
+	
+	OCFS2_FS_ROOT=$OCFS2_CLUSTER_ROOT/heartbeat/$OCFS2_UUID
+}
+
 # Check the arguments passed to this script
 if
   [ $# -ne 1 ]
@@ -428,6 +669,17 @@ case $DEVICE in
 	;;
 esac
 
+if [ "$FSTYPE" = "ocfs2" ]; then
+	ocfs2_init
+else 
+	if [ -n "$OCF_RESKEY_clone_max" ]; then
+		ocf_log err "DANGER! $FSTYPE on $DEVICE is NOT cluster-aware!"
+		ocf_log err "DO NOT RUN IT AS A CLONE!"
+		ocf_log err "Politely refusing to proceed to avoid data corruption."
+		exit $OCF_ERR_GENERIC	
+	fi
+fi
+
 # It is possible that OCF_RESKEY_directory has one or even multiple trailing "/".
 # But the output of `mount` and /proc/mounts do not.
 if [ -z $OCF_RESKEY_directory ]; then
@@ -439,6 +691,8 @@ else
     MOUNTPOINT=$(echo $OCF_RESKEY_directory | sed 's/\/*$//')
     : ${MOUNTPOINT:=/}
     # At this stage, $MOUNTPOINT does not contain trailing "/" unless it is "/"
+    # TODO: / mounted via Filesystem sounds dangerous. On stop, we'll
+    # kill the whole system. Is that a good idea?
 fi
 	
 # Check to make sure the utilites are found
@@ -451,6 +705,8 @@ check_util $UMOUNT
 case $OP in
   start)		Filesystem_start
 			;;
+  notify)		Filesystem_notify
+			;;
   stop)			Filesystem_stop
 			;;
   status|monitor)	Filesystem_status

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [Linux-ha-dev] Re: [PATCH 00/11] ocfs2: implement userspace clustering interface
       [not found]           ` <43E0D25D.9080105@unix.sh>
@ 2006-02-01 15:29             ` Lars Marowsky-Bree
  0 siblings, 0 replies; 7+ messages in thread
From: Lars Marowsky-Bree @ 2006-02-01 15:29 UTC (permalink / raw)
  To: ocfs2-devel

On 2006-02-01T08:23:09, Alan Robertson <alanr@unix.sh> wrote:

> Except that you CANNOT mount, umount, or mkfs before the CRM starts. 
> This means you can't put it in fstab like people conventionally do. 
> (Unless of course, the CRM somehow gets started really early - this 
> would likely be messy)

Well of course. That is quite true. You can't access a cluster
filesystem before the cluster stack is up. 

And just like other Filesystem instances or any other resource on our
control, we require that it not be started before us; ie, not mounted
auotmatically on boot. 

But, if the admin wishes, this could be implemented similar to ocfs2
already does it - namely, it already has to delay mounting until after
the network is up (like NFS), and would thus delay until hb is up. 

Thanks for the clarification.

This is not much of an issue unless we aim for "shared root" on a
cluster filesystem, in which case we'd need to get fancy with
initrd/initramfs and initialize (maybe in a low-cost read/only mode)
access to the root fs. This is something I'm right now not that
interested in because of the pain this implies at various places and
would require changes to the whole distribution.

And sorry for running off on this tangent ;-) Just important to keep at
the back of the mind, even if not relevant yet.

> Lars already knows this - but for the rest of you:  This would be 
> relatively easily implemented in our current architecture.  We already 
> have a special class of resource agent which does this (STONITH). 
> Adding a "general" one would be relatively easy.  Just need to make sure 
> we design the API in an extensible way so we don't have a bunch of churn 
> later on.

Right, there's even an open bugzilla to track this feature already.


Sincerely,
    Lars Marowsky-Br?e

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-01 15:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-09 22:39 [Ocfs2-devel] [PATCH 00/11] ocfs2: implement userspace clustering interface Jeff Mahoney
2006-01-10  4:29 ` Mark Fasheh
2006-01-10  4:51   ` Jeffrey Mahoney
2006-01-10 10:43     ` Lars Marowsky-Bree
2006-01-10 19:08       ` Mark Fasheh
2006-02-01 12:27         ` Lars Marowsky-Bree
     [not found]           ` <43E0D25D.9080105@unix.sh>
2006-02-01 15:29             ` [Ocfs2-devel] [Linux-ha-dev] " Lars Marowsky-Bree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.