ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support
@ 2022-07-30  1:14 Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown Heming Zhao via Ocfs2-devel
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-07-30  1:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

This serial patches re-enable ocfs2 non-clustered mount feature.

the previous patch c80af0c250c8 (Revert "ocfs2: mount shared volume
without ha stack") revert Gang's non-clustered mount patch. This
serial patches re-enable ocfs2 non-clustered mount.

the key different between local mount and non-clustered mount: 
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.

For avoiding data corruption when non-clustered & clustered mount are
happening at same time, this serial patches also introduces MMP
feature. MMP (Multiple Mount Protection) idea got from ext4 MMP
(fs/ext4/mmp.c) which protects fs from being mounted more than once.
For ocfs2 is a clustered fs and also for compatible with existing
slotmap feature, I did some optimization and modification when
porting from ext4 MMP to ocfs2.

The related userspace code for supporting MMP had been sent to github
for reviewing:
- https://github.com/markfasheh/ocfs2-tools/pull/58

ocfs2-tools enable MMP and check status:

```
# enable MMP
tunefs.ocfs2 --fs-feature=mmp /dev/vdb

# check the command result
tunefs.ocfs2 -Q "%H\n" /dev/vdb | grep MMP

# active MMP on nocluster mount
mount -t ocfs2 -o nocluster /dev/vdb /mnt

# check slotmap info
# echo slotmap | PAGER=cat debugfs.ocfs2 /dev/vdb
```

=== below are test cases for patches ====

<1> non-clustered mount vs local mount

1.1 tunefs.ocfs2 can't convert local/nolocal mount without ha stack.

```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=local /dev/vdb  (<== success)
tunefs.ocfs2 --fs-features=nolocal /dev/vdb  (<== success)
(on another node without ha stack)
tunefs.ocfs2 --fs-features=local /dev/vdb  (<== failure)
```

1.2 non-cluster feature can run without ha stack.
```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
(on another node without ha stack)
mount -t ocfs2 -o nocluster /dev/vdb /mnt  (<== success)
```


<2> do clustered & non-clustered mount on same node

2.1  non-clustered mount => clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
mount -t ocfs2 /dev/vdb /mnt               (<=== failure)
```

2.2 clustered mount => non-clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt  (<=== failure)
```

<3> one node does clustered mount, another does non-clustered mount

test rule: clustered mount and non-clustered mount can not exist at same
time.

3.1 clustered mount @node1 => [no]clustered mount @node2

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```

3.2 enable mmp, repeate 3.1 case

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb   (<== enable mmp)
mount -t ocfs2 /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== wait ~22s [*] for mmp,
then success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```

[*] 22s:
(OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) * 2 times (calling
schedule_timeout_interruptible)

3.3 noclustered mount @node1 => [no]clustered  mount @node2

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success, without mmp
enable)
umount /mnt               (<== will ZERO out slotmap area while node1
still mounting)
```

3.4 enable mmp, repeate 3.3 case.

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb   (<== enable mmp)
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure, denied by mmp)
```

<4> simulate mounting after machine crash

info:
- below all steps do on one node
- address 287387648 is the '//slot_map' extent address.
- test the rule: If last mount didn't do unmount, (eg: crash), the next
  mount MUST be same mount type.

4.0 how to calculate '//slot_map' extent address

```
# PAGER=cat debugfs.ocfs2 -R "stats" /dev/vdb | grep "Block Size Bits"
        Block Size Bits: 12   Cluster Size Bits: 12

# PAGER=cat debugfs.ocfs2 -R "stat //slot_map" /dev/vdb | grep -A1
# "Block#"
        ## Offset        Clusters       Block#          Flags
        0  0             1              70163           0x0
```

70163 * (1<<12) = 70163 * 4096 = 287387648


4.1 clustered mount => crash => non-clustered mount fails => clean
slotmap => non-clustered mount succeeds

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.cluster.mnted  (<== backup slot info)
umount /mnt
dd if=/root/slotmap.cluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32 (<== overwrite)

mount -t ocfs2 -o nocluster /dev/vdb /mnt   <== failure
mount -t ocfs2 /dev/vdb /mnt && umount /mnt <== clean slot 0
mount -t ocfs2 -o nocluster /dev/vdb /mnt   <== success
```

4.2  non-clustered mount => crash => clustered mount fails => clean
slotmap => clustered mount succeeds

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.nocluster.mnted
umount /mnt
dd if=/root/slotmap.nocluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32

mount -t ocfs2 /dev/vdb /mnt   <== failure
mount -t ocfs2 -o nocluster /dev/vdb /mnt && umount /mnt <== clean slot
0
mount -t ocfs2 /dev/vdb /mnt   <== success
```

<5> MMP test

5.1 node1 noclustered mount => node 2 noclustered mount

disable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success)
```

enable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== wait ~12s[*], failure by
mmp)
```

[*] 12s:
sleep (OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) then detect mmp_seq was
changed, then failed.

5.2 node1 clustered mount => node 2 clustered mount

see case 3.2

5.3 node1 noclustered mount => node 2 noclustered mount

see case 3.4

5.4 remount test

5.4.1 non-clustered mount (run commands on same node)

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb

mount -t ocfs2 -o nocluster /dev/vdb /mnt
ps axj | grep kmmpd                            (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ')

mount -o remount,ro,nocluster /dev/vdb /mnt    (<== kmmpd will stop)
ps axj | grep kmmpd  (<== won't show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ_CLEAN')

mount -o remount,rw,nocluster /dev/vdb /mnt    (<== kmmpd will start)
ps axj | grep kmmpd  (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ')
```

5.4.2 clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb

mount -t ocfs2 /dev/vdb /mnt                   (<== clustered mount
won't create kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')

mount -o remount,ro /dev/vdb /mnt
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')

mount -o remount,rw /dev/vdb /mnt              (<== wait for ~22s by mmp
start)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')
```

Heming Zhao (4):
  ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  ocfs2: add mlog ML_WARNING support
  re-enable "ocfs2: mount shared volume without ha stack"
  ocfs2: introduce ext4 MMP feature

 fs/ocfs2/cluster/masklog.c |   3 +
 fs/ocfs2/cluster/masklog.h |   9 +-
 fs/ocfs2/dlmglue.c         |   3 +
 fs/ocfs2/ocfs2.h           |   6 +-
 fs/ocfs2/ocfs2_fs.h        |  13 +-
 fs/ocfs2/slot_map.c        | 479 +++++++++++++++++++++++++++++++++++--
 fs/ocfs2/slot_map.h        |   3 +
 fs/ocfs2/super.c           |  42 +++-
 8 files changed, 527 insertions(+), 31 deletions(-)

-- 
2.37.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  2022-07-30  1:14 [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support Heming Zhao via Ocfs2-devel
@ 2022-07-30  1:14 ` Heming Zhao via Ocfs2-devel
  2022-08-08  6:51   ` Joseph Qi via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support Heming Zhao via Ocfs2-devel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-07-30  1:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

On local mount mode, there is no dlm resource initalized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
flow will call ocfs2_dlm_shutdown(), then does dlm resource
cleanup job, which will trigger kernel crash.

Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/dlmglue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 801e60bab955..1438ac14940b 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
 void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
 			int hangup_pending)
 {
+	if (ocfs2_mount_local(osb))
+		return;
+
 	ocfs2_drop_osb_locks(osb);
 
 	/*
-- 
2.37.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support
  2022-07-30  1:14 [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown Heming Zhao via Ocfs2-devel
@ 2022-07-30  1:14 ` Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack" Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature Heming Zhao via Ocfs2-devel
  3 siblings, 0 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-07-30  1:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

This commit gives new message type for ocfs2.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/cluster/masklog.c | 3 +++
 fs/ocfs2/cluster/masklog.h | 9 +++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c
index 563881ddbf00..bac3488e8002 100644
--- a/fs/ocfs2/cluster/masklog.c
+++ b/fs/ocfs2/cluster/masklog.c
@@ -63,6 +63,9 @@ void __mlog_printk(const u64 *mask, const char *func, int line,
 	if (*mask & ML_ERROR) {
 		level = KERN_ERR;
 		prefix = "ERROR: ";
+	} else if (*mask & ML_WARNING) {
+		level = KERN_WARNING;
+		prefix = "WARNING: ";
 	} else if (*mask & ML_NOTICE) {
 		level = KERN_NOTICE;
 	} else {
diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h
index b73fc42e46ff..d0bc4fe8cf3d 100644
--- a/fs/ocfs2/cluster/masklog.h
+++ b/fs/ocfs2/cluster/masklog.h
@@ -86,10 +86,11 @@
 
 /* bits that are infrequently given and frequently matched in the high word */
 #define ML_ERROR	0x1000000000000000ULL /* sent to KERN_ERR */
-#define ML_NOTICE	0x2000000000000000ULL /* setn to KERN_NOTICE */
-#define ML_KTHREAD	0x4000000000000000ULL /* kernel thread activity */
+#define ML_NOTICE	0x2000000000000000ULL /* sent to KERN_NOTICE */
+#define ML_WARNING	0x4000000000000000ULL /* sent to KERN_WARNING */
+#define ML_KTHREAD	0x8000000000000000ULL /* kernel thread activity */
 
-#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
+#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_WARNING|ML_NOTICE)
 #ifndef MLOG_MASK_PREFIX
 #define MLOG_MASK_PREFIX 0
 #endif
@@ -102,7 +103,7 @@
 #if defined(CONFIG_OCFS2_DEBUG_MASKLOG)
 #define ML_ALLOWED_BITS ~0
 #else
-#define ML_ALLOWED_BITS (ML_ERROR|ML_NOTICE)
+#define ML_ALLOWED_BITS (ML_ERROR|ML_WARNING|ML_NOTICE)
 #endif
 
 #define MLOG_MAX_BITS 64
-- 
2.37.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-07-30  1:14 [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown Heming Zhao via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support Heming Zhao via Ocfs2-devel
@ 2022-07-30  1:14 ` Heming Zhao via Ocfs2-devel
  2022-07-31 17:42   ` Mark Fasheh via Ocfs2-devel
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature Heming Zhao via Ocfs2-devel
  3 siblings, 1 reply; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-07-30  1:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

the key different between local mount and non-clustered mount:
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.

commit 912f655d78c5 ("ocfs2: mount shared volume without ha stack") had
bug, then commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 reverted it.

Let's give some explain for the issue mentioned by commit c80af0c250c8.

Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if'
accessment is wrong. sl_node_num could be 0 at o2cb env.

with current information, the trigger flow (base on 912f655d78c5):
1>
nodeA with 'node_num = 0' for mounting. it will succeed.
at this time, slotmap extent block will contains es_valid:1 &
es_node_num:0 for nodeA
then ocfs2_update_disk_slot() will write back slotmap info to disk.

2>
then, nodeB with 'node_num = 1' for mounting
this time, osb->node_num is 1 (set by config file), osb->preferred is
OCFS2_INVALID_SLOT (set by ocfs2_parse_options).

ocfs2_find_slot
 + ocfs2_update_slot_info //read slotmap info from disk
 |  + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num = 0
 |
 + __ocfs2_node_num_to_slot //will return -ENOENT.
 + __ocfs2_find_empty_slot
    + if ((preferred >= 0) && (preferred < si->si_num_slots))
    |  fails enter this 'if' for preferred value is OCFS2_INVALID_SLOT
    |
    + 'for(i = 0; i < si->si_num_slots; i++)' search slot 0
      successfully.
    |  'si->si_slots[0].sl_node_num' is false. trigger 'break' condition.
    |
    + return slot 0.
       it will cause nodeB grab nodeA journal dlm lock, then trigger hung.

How to do for this bug?

This commit re-enabled 912f655d78c5, next commit (add MMP support) will fix
the issue.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/ocfs2.h    |  4 +++-
 fs/ocfs2/slot_map.c | 46 ++++++++++++++++++++++++++-------------------
 fs/ocfs2/super.c    | 21 +++++++++++++++++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 740b64238312..337527571461 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -277,6 +277,7 @@ enum ocfs2_mount_options
 	OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15,  /* Journal Async Commit */
 	OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */
 	OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */
+	OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
 };
 
 #define OCFS2_OSB_SOFT_RO	0x0001
@@ -672,7 +673,8 @@ static inline int ocfs2_cluster_o2cb_global_heartbeat(struct ocfs2_super *osb)
 
 static inline int ocfs2_mount_local(struct ocfs2_super *osb)
 {
-	return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
+	return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
+		|| (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
 }
 
 static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index da7718cef735..0b0ae3ebb0cf 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -252,14 +252,16 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid) {
+		if (!si->si_slots[preferred].sl_valid ||
+		    !si->si_slots[preferred].sl_node_num) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid) {
+		if (!si->si_slots[i].sl_valid ||
+		    !si->si_slots[i].sl_node_num) {
 			ret = i;
 			break;
 		}
@@ -454,24 +456,30 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
 
-	/* search for ourselves first and take the slot if it already
-	 * exists. Perhaps we need to mark this in a variable for our
-	 * own journal recovery? Possibly not, though we certainly
-	 * need to warn to the user */
-	slot = __ocfs2_node_num_to_slot(si, osb->node_num);
-	if (slot < 0) {
-		/* if no slot yet, then just take 1st available
-		 * one. */
-		slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+	if (ocfs2_mount_local(osb))
+		/* use slot 0 directly in local mode */
+		slot = 0;
+	else {
+		/* search for ourselves first and take the slot if it already
+		 * exists. Perhaps we need to mark this in a variable for our
+		 * own journal recovery? Possibly not, though we certainly
+		 * need to warn to the user */
+		slot = __ocfs2_node_num_to_slot(si, osb->node_num);
 		if (slot < 0) {
-			spin_unlock(&osb->osb_lock);
-			mlog(ML_ERROR, "no free slots available!\n");
-			status = -EINVAL;
-			goto bail;
-		}
-	} else
-		printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
-		       "allocated to this node!\n", slot, osb->dev_str);
+			/* if no slot yet, then just take 1st available
+			 * one. */
+			slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+			if (slot < 0) {
+				spin_unlock(&osb->osb_lock);
+				mlog(ML_ERROR, "no free slots available!\n");
+				status = -EINVAL;
+				goto bail;
+			}
+		} else
+			printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
+			       "already allocated to this node!\n",
+			       slot, osb->dev_str);
+	}
 
 	ocfs2_set_slot(si, slot, osb->node_num);
 	osb->slot_num = slot;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 438be028935d..f7298816d8d9 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -172,6 +172,7 @@ enum {
 	Opt_dir_resv_level,
 	Opt_journal_async_commit,
 	Opt_err_cont,
+	Opt_nocluster,
 	Opt_err,
 };
 
@@ -205,6 +206,7 @@ static const match_table_t tokens = {
 	{Opt_dir_resv_level, "dir_resv_level=%u"},
 	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_err_cont, "errors=continue"},
+	{Opt_nocluster, "nocluster"},
 	{Opt_err, NULL}
 };
 
@@ -616,6 +618,13 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 		goto out;
 	}
 
+	tmp = OCFS2_MOUNT_NOCLUSTER;
+	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+		ret = -EINVAL;
+		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
+		goto out;
+	}
+
 	tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
 		OCFS2_MOUNT_HB_NONE;
 	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
@@ -856,6 +865,7 @@ static int ocfs2_verify_userspace_stack(struct ocfs2_super *osb,
 	}
 
 	if (ocfs2_userspace_stack(osb) &&
+	    !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
 	    strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
 		    OCFS2_STACK_LABEL_LEN)) {
 		mlog(ML_ERROR,
@@ -1127,6 +1137,11 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
 	       osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" :
 	       "ordered");
 
+	if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
+	   !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
+		printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
+		       "without cluster aware mode.\n", osb->dev_str);
+
 	atomic_set(&osb->vol_state, VOLUME_MOUNTED);
 	wake_up(&osb->osb_mount_event);
 
@@ -1437,6 +1452,9 @@ static int ocfs2_parse_options(struct super_block *sb,
 		case Opt_journal_async_commit:
 			mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
 			break;
+		case Opt_nocluster:
+			mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
+			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1548,6 +1566,9 @@ static int ocfs2_show_options(struct seq_file *s, struct dentry *root)
 	if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
 		seq_printf(s, ",journal_async_commit");
 
+	if (opts & OCFS2_MOUNT_NOCLUSTER)
+		seq_printf(s, ",nocluster");
+
 	return 0;
 }
 
-- 
2.37.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-07-30  1:14 [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support Heming Zhao via Ocfs2-devel
                   ` (2 preceding siblings ...)
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack" Heming Zhao via Ocfs2-devel
@ 2022-07-30  1:14 ` Heming Zhao via Ocfs2-devel
  2022-07-31  9:13   ` heming.zhao--- via Ocfs2-devel
  2022-08-08  8:19   ` Joseph Qi via Ocfs2-devel
  3 siblings, 2 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-07-30  1:14 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

MMP (multiple mount protection) gives filesystem ability to prevent
from being mounted multiple times.

For avoiding data corruption when non-clustered and/or clustered mount
are happening at same time, this commit introduced MMP feature. MMP
idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
fs and also for compatible with existing slotmap feature, I did some
optimization and modification when porting from ext4 to ocfs2.

For optimization:
mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
mode.

We set a rule:
If last mount didn't do unmount, (eg: crash), the next mount MUST be
same mount type.

At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
mount shared volume without ha stack") mentioned issue.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/ocfs2.h    |   2 +
 fs/ocfs2/ocfs2_fs.h |  13 +-
 fs/ocfs2/slot_map.c | 459 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ocfs2/slot_map.h |   3 +
 fs/ocfs2/super.c    |  23 ++-
 5 files changed, 479 insertions(+), 21 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..37a7c5855d07 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -337,6 +337,8 @@ struct ocfs2_super
 	unsigned int node_num;
 	int slot_num;
 	int preferred_slot;
+	u16 mmp_update_interval;
+	struct task_struct *mmp_task;
 	int s_sectsize_bits;
 	int s_clustersize;
 	int s_clustersize_bits;
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 638d875eccc7..015672f75563 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -87,7 +87,8 @@
 					 | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \
 					 | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG	\
 					 | OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \
-					 | OCFS2_FEATURE_INCOMPAT_APPEND_DIO)
+					 | OCFS2_FEATURE_INCOMPAT_APPEND_DIO \
+					 | OCFS2_FEATURE_INCOMPAT_MMP)
 #define OCFS2_FEATURE_RO_COMPAT_SUPP	(OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \
 					 | OCFS2_FEATURE_RO_COMPAT_USRQUOTA \
 					 | OCFS2_FEATURE_RO_COMPAT_GRPQUOTA)
@@ -167,6 +168,11 @@
  */
 #define OCFS2_FEATURE_INCOMPAT_APPEND_DIO	0x8000
 
+/*
+ * Multiple mount protection
+ */
+#define OCFS2_FEATURE_INCOMPAT_MMP	0x10000
+
 /*
  * backup superblock flag is used to indicate that this volume
  * has backup superblocks.
@@ -535,8 +541,7 @@ struct ocfs2_slot_map {
 };
 
 struct ocfs2_extended_slot {
-/*00*/	__u8	es_valid;
-	__u8	es_reserved1[3];
+/*00*/	__le32	es_valid;
 	__le32	es_node_num;
 /*08*/
 };
@@ -611,7 +616,7 @@ struct ocfs2_super_block {
 						     INCOMPAT flag set. */
 /*B8*/	__le16 s_xattr_inline_size;	/* extended attribute inline size
 					   for this fs*/
-	__le16 s_reserved0;
+	__le16 s_mmp_update_interval; /* # seconds to wait in MMP checking */
 	__le32 s_dx_seed[3];		/* seed[0-2] for dx dir hash.
 					 * s_uuid_hash serves as seed[3]. */
 /*C0*/  __le64 s_reserved2[15];		/* Fill out superblock */
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index 0b0ae3ebb0cf..86a21140ead6 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -8,6 +8,8 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/highmem.h>
+#include <linux/random.h>
+#include <linux/kthread.h>
 
 #include <cluster/masklog.h>
 
@@ -24,9 +26,48 @@
 
 #include "buffer_head_io.h"
 
+/*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the '//slot_map' field in the system dir.
+ * Programs that check MMP should assume that if SEQ_FSCK (or any unknown
+ * code above SEQ_MAX) is present then it is NOT safe to use the filesystem.
+ */
+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */
+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount
+											   under MMP disabled */
+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount
+											   under MMP disabled */
+
+#define OCFS2_SLOT_INFO_OLD_VALID   1 /* use for old slot info */
+
+/*
+ * Check interval multiplier
+ * The MMP block is written every update interval and initially checked every
+ * update interval x the multiplier (the value is then adapted based on the
+ * write latency). The reason is that writes can be delayed under load and we
+ * don't want readers to incorrectly assume that the filesystem is no longer
+ * in use.
+ */
+#define OCFS2_MMP_CHECK_MULT		2UL
+
+/*
+ * Minimum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MIN_CHECK_INTERVAL	5UL
+
+/*
+ * Maximum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MAX_CHECK_INTERVAL	300UL
 
 struct ocfs2_slot {
-	int sl_valid;
+	union {
+		unsigned int sl_valid;
+		unsigned int mmp_seq;
+	};
 	unsigned int sl_node_num;
 };
 
@@ -52,11 +93,11 @@ static void ocfs2_invalidate_slot(struct ocfs2_slot_info *si,
 }
 
 static void ocfs2_set_slot(struct ocfs2_slot_info *si,
-			   int slot_num, unsigned int node_num)
+			   int slot_num, unsigned int node_num, unsigned int valid)
 {
 	BUG_ON((slot_num < 0) || (slot_num >= si->si_num_slots));
 
-	si->si_slots[slot_num].sl_valid = 1;
+	si->si_slots[slot_num].sl_valid = valid;
 	si->si_slots[slot_num].sl_node_num = node_num;
 }
 
@@ -75,7 +116,8 @@ static void ocfs2_update_slot_info_extended(struct ocfs2_slot_info *si)
 		     i++, slotno++) {
 			if (se->se_slots[i].es_valid)
 				ocfs2_set_slot(si, slotno,
-					       le32_to_cpu(se->se_slots[i].es_node_num));
+					       le32_to_cpu(se->se_slots[i].es_node_num),
+					       le32_to_cpu(se->se_slots[i].es_valid));
 			else
 				ocfs2_invalidate_slot(si, slotno);
 		}
@@ -97,7 +139,8 @@ static void ocfs2_update_slot_info_old(struct ocfs2_slot_info *si)
 		if (le16_to_cpu(sm->sm_slots[i]) == (u16)OCFS2_INVALID_SLOT)
 			ocfs2_invalidate_slot(si, i);
 		else
-			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]));
+			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]),
+						OCFS2_SLOT_INFO_OLD_VALID);
 	}
 }
 
@@ -252,16 +295,14 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid ||
-		    !si->si_slots[preferred].sl_node_num) {
+		if (!si->si_slots[preferred].sl_valid) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid ||
-		    !si->si_slots[i].sl_node_num) {
+		if (!si->si_slots[i].sl_valid) {
 			ret = i;
 			break;
 		}
@@ -270,6 +311,43 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
 	return ret;
 }
 
+/* Return first used slot.
+ * -ENOENT means all slots are clean, ->sl_valid should be
+ * OCFS2_MMP_SEQ_CLEAN or ZERO */
+static int __ocfs2_find_used_slot(struct ocfs2_slot_info *si)
+{
+	int i, ret = -ENOENT, valid;
+
+	for (i = 0; i < si->si_num_slots; i++) {
+		valid = si->si_slots[i].sl_valid;
+		if (valid == 0 || valid == OCFS2_MMP_SEQ_CLEAN)
+			continue;
+		if (valid <= OCFS2_MMP_SEQ_MAX ||
+			valid == OCFS2_MMP_SEQ_FSCK ||
+			valid == OCFS2_VALID_CLUSTER ||
+			valid == OCFS2_VALID_NOCLUSTER) {
+			ret = i;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int __ocfs2_find_expected_slot(struct ocfs2_slot_info *si,
+								unsigned int expected)
+{
+	int i;
+
+	for (i = 0; i < si->si_num_slots; i++) {
+		if (si->si_slots[i].sl_valid == expected) {
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 int ocfs2_node_num_to_slot(struct ocfs2_super *osb, unsigned int node_num)
 {
 	int slot;
@@ -445,21 +523,357 @@ void ocfs2_free_slot_info(struct ocfs2_super *osb)
 	__ocfs2_free_slot_info(si);
 }
 
+/*
+ * Get a random new sequence number but make sure it is not greater than
+ * EXT4_MMP_SEQ_MAX.
+ */
+static unsigned int mmp_new_seq(void)
+{
+	u32 new_seq;
+
+	do {
+		new_seq = prandom_u32();
+	} while (new_seq > OCFS2_MMP_SEQ_MAX);
+
+	if (new_seq == 0)
+		return 1;
+	else
+		return new_seq;
+}
+
+/*
+ * kmmpd will update the MMP sequence every mmp_update_interval seconds
+ */
+static int kmmpd(void *data)
+{
+	struct ocfs2_super *osb = data;
+	struct super_block *sb = osb->sb;
+	struct ocfs2_slot_info *si = osb->slot_info;
+	int slot = osb->slot_num;
+	u32 seq, mmp_seq;
+	unsigned long failed_writes = 0;
+	u16 mmp_update_interval = osb->mmp_update_interval;
+	unsigned int mmp_check_interval;
+	unsigned long last_update_time;
+	unsigned long diff;
+	int retval = 0;
+
+	if (!ocfs2_mount_local(osb)) {
+		mlog(ML_ERROR, "kmmpd thread only works for local mount mode.\n");
+		goto wait_to_exit;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	seq = si->si_slots[slot].mmp_seq;
+
+	/*
+	 * Start with the higher mmp_check_interval and reduce it if
+	 * the MMP block is being updated on time.
+	 */
+	mmp_check_interval = max(OCFS2_MMP_CHECK_MULT * mmp_update_interval,
+				 OCFS2_MMP_MIN_CHECK_INTERVAL);
+
+	while (!kthread_should_stop() && !sb_rdonly(sb)) {
+		if (!OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+			mlog(ML_WARNING, "kmmpd being stopped since MMP feature"
+				     " has been disabled.");
+			goto wait_to_exit;
+		}
+		if (++seq > OCFS2_MMP_SEQ_MAX)
+			seq = 1;
+
+		spin_lock(&osb->osb_lock);
+		si->si_slots[slot].mmp_seq = mmp_seq = seq;
+		spin_unlock(&osb->osb_lock);
+
+		last_update_time = jiffies;
+		retval = ocfs2_update_disk_slot(osb, si, slot);
+
+		/*
+		 * Don't spew too many error messages. Print one every
+		 * (s_mmp_update_interval * 60) seconds.
+		 */
+		if (retval) {
+			if ((failed_writes % 60) == 0) {
+				ocfs2_error(sb, "Error writing to MMP block");
+			}
+			failed_writes++;
+		}
+
+		diff = jiffies - last_update_time;
+		if (diff < mmp_update_interval * HZ)
+			schedule_timeout_interruptible(mmp_update_interval *
+						       HZ - diff);
+
+		/*
+		 * We need to make sure that more than mmp_check_interval
+		 * seconds have not passed since writing. If that has happened
+		 * we need to check if the MMP block is as we left it.
+		 */
+		diff = jiffies - last_update_time;
+		if (diff > mmp_check_interval * HZ) {
+			retval = ocfs2_refresh_slot_info(osb);
+			if (retval) {
+				ocfs2_error(sb, "error reading MMP data: %d", retval);
+				goto wait_to_exit;
+			}
+
+			if (si->si_slots[slot].mmp_seq != mmp_seq) {
+				ocfs2_error(sb, "Error while updating MMP info. "
+					     "The filesystem seems to have been"
+					     " multiply mounted.");
+				retval = -EBUSY;
+				goto wait_to_exit;
+			}
+		}
+
+		 /*
+		 * Adjust the mmp_check_interval depending on how much time
+		 * it took for the MMP block to be written.
+		 */
+		mmp_check_interval = max(min(OCFS2_MMP_CHECK_MULT * diff / HZ,
+					     OCFS2_MMP_MAX_CHECK_INTERVAL),
+					     OCFS2_MMP_MIN_CHECK_INTERVAL);
+	}
+
+	/*
+	 * Unmount seems to be clean.
+	 */
+	spin_lock(&osb->osb_lock);
+	si->si_slots[slot].mmp_seq = OCFS2_MMP_SEQ_CLEAN;
+	spin_unlock(&osb->osb_lock);
+
+	retval = ocfs2_update_disk_slot(osb, si, 0);
+
+wait_to_exit:
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!kthread_should_stop())
+			schedule();
+	}
+	set_current_state(TASK_RUNNING);
+	return retval;
+}
+
+void ocfs2_stop_mmpd(struct ocfs2_super *osb)
+{
+	if (osb->mmp_task) {
+		kthread_stop(osb->mmp_task);
+		osb->mmp_task = NULL;
+	}
+}
+
+/*
+ * Protect the filesystem from being mounted more than once.
+ *
+ * This function was inspired by ext4 MMP feature. Because HA stack
+ * helps ocfs2 to manage nodes join/leave, so we only focus on MMP
+ * under nocluster mode.
+ * Another info is ocfs2 only uses slot 0 on nocuster mode.
+ *
+ * es_valid:
+ *  0: not available
+ *  1: valid, cluster mode
+ *  2: valid, nocluster mode
+ *
+ * parameters:
+ *  osb: the struct ocfs2_super
+ *  noclustered: under noclustered mount
+ *  slot: prefer slot number
+ */
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered)
+{
+	struct buffer_head *bh = NULL;
+	u32 seq;
+	struct ocfs2_slot_info *si = osb->slot_info;
+	unsigned int mmp_check_interval = osb->mmp_update_interval;
+	unsigned int wait_time = 0;
+	int retval = 0;
+	int slot = osb->slot_num;
+
+	if (!ocfs2_uses_extended_slot_map(osb)) {
+		mlog(ML_WARNING, "MMP only works on extended slot map.\n");
+		retval = -EINVAL;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+
+	if (mmp_check_interval < OCFS2_MMP_MIN_CHECK_INTERVAL)
+		mmp_check_interval = OCFS2_MMP_MIN_CHECK_INTERVAL;
+
+	spin_lock(&osb->osb_lock);
+	seq = si->si_slots[slot].mmp_seq;
+
+	if (__ocfs2_find_used_slot(si) == -ENOENT)
+		goto skip;
+
+	/* TODO ocfs2-tools need to support this flag */
+	if (__ocfs2_find_expected_slot(si, OCFS2_MMP_SEQ_FSCK)) {
+		mlog(ML_NOTICE, "fsck is running on the filesystem");
+		spin_unlock(&osb->osb_lock);
+		retval = -EBUSY;
+		goto bail;
+	}
+	spin_unlock(&osb->osb_lock);
+
+	wait_time = min(mmp_check_interval * 2 + 1, mmp_check_interval + 60);
+
+	/* Print MMP interval if more than 20 secs. */
+	if (wait_time > OCFS2_MMP_MIN_CHECK_INTERVAL * 4)
+		mlog(ML_WARNING, "MMP interval %u higher than expected, please"
+			     " wait.\n", wait_time * 2);
+
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+	if (seq != si->si_slots[slot].mmp_seq) {
+		mlog(ML_ERROR, "Device is already active on another node.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	spin_lock(&osb->osb_lock);
+skip:
+	/*
+	 * write a new random sequence number.
+	 */
+	seq = mmp_new_seq();
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq);
+	ocfs2_set_slot(si, slot, osb->node_num, seq);
+	spin_unlock(&osb->osb_lock);
+
+	ocfs2_update_disk_slot_extended(si, slot, &bh);
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq);
+	retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+	if (retval < 0) {
+		mlog_errno(retval);
+		goto bail;
+	}
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x wait_time: %u\n", seq, si->si_slots[slot].mmp_seq, wait_time);
+
+	/*
+	 * wait for MMP interval and check mmp_seq.
+	 */
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq);
+	if (seq != si->si_slots[slot].mmp_seq) {
+		mlog(ML_ERROR, "Update seq failed, device is already active on another node.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	/*
+	 * There are two reasons we don't create kmmpd on clustered mount:
+	 * - ocfs2 needs to grab osb->osb_lock to modify/access osb->si.
+	 * - For huge number nodes cluster, nodes update same sector
+	 *   of '//slot_map' will cause IO performance issue.
+	 *
+	 * Then there has another question:
+	 * On clustered mount, MMP seq won't update, and MMP how to
+	 * handle a noclustered mount when there already exist
+	 * clustered mount.
+	 * The answer is the rule mentioned in ocfs2_find_slot().
+	 */
+	if (!noclustered) {
+		spin_lock(&osb->osb_lock);
+		ocfs2_set_slot(si, slot, osb->node_num, OCFS2_VALID_CLUSTER);
+		spin_unlock(&osb->osb_lock);
+
+		ocfs2_update_disk_slot_extended(si, slot, &bh);
+		retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+		goto bail;
+	}
+
+	/*
+	 * Start a kernel thread to update the MMP block periodically.
+	 */
+	osb->mmp_task = kthread_run(kmmpd, osb, "kmmpd-%s", osb->sb->s_id);
+	if (IS_ERR(osb->mmp_task)) {
+		osb->mmp_task = NULL;
+		mlog(ML_WARNING, "Unable to create kmmpd thread for %s.",
+			     osb->sb->s_id);
+		retval = -EPERM;
+		goto bail;
+	}
+
+bail:
+	return retval;
+}
+
+static void show_conflict_mnt_msg(int clustered)
+{
+	const char *exist = clustered ? "non-clustered" : "clustered";
+
+	mlog(ML_ERROR, "Found %s mount info!", exist);
+	mlog(ML_ERROR, "Please clean %s slotmap info for mounting.\n", exist);
+	mlog(ML_ERROR, "eg. remount then unmount with %s mode\n", exist);
+}
+
+/*
+ * Even under readonly mode, we write slot info on disk.
+ * The logic is correct: if not change slot info on readonly
+ * mode, in cluster env, later mount from another node
+ * may reuse the same slot, deadlock happen!
+ */
 int ocfs2_find_slot(struct ocfs2_super *osb)
 {
-	int status;
+	int status = -EPERM;
 	int slot;
+	int noclustered = 0;
 	struct ocfs2_slot_info *si;
 
 	si = osb->slot_info;
 
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
+	slot = __ocfs2_find_used_slot(si);
+	if (slot == 0 &&
+		((si->si_slots[0].sl_valid == OCFS2_VALID_NOCLUSTER) ||
+		 (si->si_slots[0].sl_valid < OCFS2_MMP_SEQ_MAX)))
+		noclustered = 1;
 
-	if (ocfs2_mount_local(osb))
-		/* use slot 0 directly in local mode */
-		slot = 0;
-	else {
+	/*
+	 * We set a rule:
+	 * If last mount didn't do unmount, (eg: crash), the next mount
+	 * MUST be same mount type.
+	 */
+	if (ocfs2_mount_local(osb)) {
+		/* empty slotmap, or device didn't unmount from last time */
+		if ((slot == -ENOENT) || noclustered) {
+			/* use slot 0 directly in local mode */
+			slot = 0;
+			noclustered = 1;
+		} else {
+			spin_unlock(&osb->osb_lock);
+			show_conflict_mnt_msg(0);
+			status = -EINVAL;
+			goto bail;
+		}
+	} else {
+		if (noclustered) {
+			spin_unlock(&osb->osb_lock);
+			show_conflict_mnt_msg(1);
+			status = -EINVAL;
+			goto bail;
+		}
 		/* search for ourselves first and take the slot if it already
 		 * exists. Perhaps we need to mark this in a variable for our
 		 * own journal recovery? Possibly not, though we certainly
@@ -481,7 +895,21 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 			       slot, osb->dev_str);
 	}
 
-	ocfs2_set_slot(si, slot, osb->node_num);
+	if (OCFS2_HAS_INCOMPAT_FEATURE(osb->sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+		osb->slot_num = slot;
+		spin_unlock(&osb->osb_lock);
+		status = ocfs2_multi_mount_protect(osb, noclustered);
+		if (status < 0) {
+			mlog(ML_ERROR, "MMP failed to start.\n");
+			goto mmp_fail;
+		}
+
+		trace_ocfs2_find_slot(osb->slot_num);
+		return status;
+	}
+
+	ocfs2_set_slot(si, slot, osb->node_num, noclustered ?
+			OCFS2_VALID_NOCLUSTER : OCFS2_VALID_CLUSTER);
 	osb->slot_num = slot;
 	spin_unlock(&osb->osb_lock);
 
@@ -490,6 +918,7 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 	status = ocfs2_update_disk_slot(osb, si, osb->slot_num);
 	if (status < 0) {
 		mlog_errno(status);
+mmp_fail:
 		/*
 		 * if write block failed, invalidate slot to avoid overwrite
 		 * slot during dismount in case another node rightly has mounted
diff --git a/fs/ocfs2/slot_map.h b/fs/ocfs2/slot_map.h
index a43644570b53..d4d147b0c190 100644
--- a/fs/ocfs2/slot_map.h
+++ b/fs/ocfs2/slot_map.h
@@ -25,4 +25,7 @@ int ocfs2_slot_to_node_num_locked(struct ocfs2_super *osb, int slot_num,
 
 int ocfs2_clear_slot(struct ocfs2_super *osb, int slot_num);
 
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered);
+void ocfs2_stop_mmpd(struct ocfs2_super *osb);
+
 #endif
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f7298816d8d9..b0e76b06efc3 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -609,6 +609,7 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 	struct mount_options parsed_options;
 	struct ocfs2_super *osb = OCFS2_SB(sb);
 	u32 tmp;
+	int noclustered;
 
 	sync_filesystem(sb);
 
@@ -619,7 +620,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 	}
 
 	tmp = OCFS2_MOUNT_NOCLUSTER;
-	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+	noclustered = osb->s_mount_opt & tmp;
+	if (noclustered != (parsed_options.mount_opt & tmp)) {
 		ret = -EINVAL;
 		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
 		goto out;
@@ -686,10 +688,20 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 			}
 			sb->s_flags &= ~SB_RDONLY;
 			osb->osb_flags &= ~OCFS2_OSB_SOFT_RO;
+			if (OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+				spin_unlock(&osb->osb_lock);
+				if (ocfs2_multi_mount_protect(osb, noclustered)) {
+					mlog(ML_ERROR, "started MMP failed.\n");
+					ocfs2_stop_mmpd(osb);
+					ret = -EROFS;
+					goto unlocked_osb;
+				}
+			}
 		}
 		trace_ocfs2_remount(sb->s_flags, osb->osb_flags, *flags);
 unlock_osb:
 		spin_unlock(&osb->osb_lock);
+unlocked_osb:
 		/* Enable quota accounting after remounting RW */
 		if (!ret && !(*flags & SB_RDONLY)) {
 			if (sb_any_quota_suspended(sb))
@@ -722,6 +734,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 		sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
 			((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) ?
 							SB_POSIXACL : 0);
+		if (sb_rdonly(osb->sb))
+			ocfs2_stop_mmpd(osb);
 	}
 out:
 	return ret;
@@ -1833,7 +1847,7 @@ static int ocfs2_mount_volume(struct super_block *sb)
 	status = ocfs2_init_local_system_inodes(osb);
 	if (status < 0) {
 		mlog_errno(status);
-		goto out_super_lock;
+		goto out_find_slot;
 	}
 
 	status = ocfs2_check_volume(osb);
@@ -1858,6 +1872,8 @@ static int ocfs2_mount_volume(struct super_block *sb)
 	/* before journal shutdown, we should release slot_info */
 	ocfs2_free_slot_info(osb);
 	ocfs2_journal_shutdown(osb);
+out_find_slot:
+	ocfs2_stop_mmpd(osb);
 out_super_lock:
 	ocfs2_super_unlock(osb, 1);
 out_dlm:
@@ -1878,6 +1894,8 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
 	osb = OCFS2_SB(sb);
 	BUG_ON(!osb);
 
+	ocfs2_stop_mmpd(osb);
+
 	/* Remove file check sysfs related directores/files,
 	 * and wait for the pending file check operations */
 	ocfs2_filecheck_remove_sysfs(osb);
@@ -2086,6 +2104,7 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	snprintf(osb->dev_str, sizeof(osb->dev_str), "%u,%u",
 		 MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev));
 
+	osb->mmp_update_interval = le16_to_cpu(di->id2.i_super.s_mmp_update_interval);
 	osb->max_slots = le16_to_cpu(di->id2.i_super.s_max_slots);
 	if (osb->max_slots > OCFS2_MAX_SLOTS || osb->max_slots == 0) {
 		mlog(ML_ERROR, "Invalid number of node slots (%u)\n",
-- 
2.37.1


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature Heming Zhao via Ocfs2-devel
@ 2022-07-31  9:13   ` heming.zhao--- via Ocfs2-devel
  2022-08-08  8:19   ` Joseph Qi via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-07-31  9:13 UTC (permalink / raw)
  To: ocfs2-devel, joseph.qi

On 7/30/22 09:14, Heming Zhao wrote:
> MMP (multiple mount protection) gives filesystem ability to prevent
> from being mounted multiple times.
> 
> For avoiding data corruption when non-clustered and/or clustered mount
> are happening at same time, this commit introduced MMP feature. MMP
> idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
> fs and also for compatible with existing slotmap feature, I did some
> optimization and modification when porting from ext4 to ocfs2.
> 
> For optimization:
> mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
> mode.
> 
> We set a rule:
> If last mount didn't do unmount, (eg: crash), the next mount MUST be
> same mount type.
> 
> At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
> mount shared volume without ha stack") mentioned issue.
> 
> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> ---
>   fs/ocfs2/ocfs2.h    |   2 +
>   fs/ocfs2/ocfs2_fs.h |  13 +-
>   fs/ocfs2/slot_map.c | 459 ++++++++++++++++++++++++++++++++++++++++++--
>   fs/ocfs2/slot_map.h |   3 +
>   fs/ocfs2/super.c    |  23 ++-
>   5 files changed, 479 insertions(+), 21 deletions(-)
> ... ...
>   /*C0*/  __le64 s_reserved2[15];		/* Fill out superblock */
> diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
> index 0b0ae3ebb0cf..86a21140ead6 100644
> --- a/fs/ocfs2/slot_map.c
> +++ b/fs/ocfs2/slot_map.c
> ... ...
> +void ocfs2_stop_mmpd(struct ocfs2_super *osb)
> +{
> +	if (osb->mmp_task) {
> +		kthread_stop(osb->mmp_task);
> +		osb->mmp_task = NULL;
> +	}
> +}
> +
> +/*
> + * Protect the filesystem from being mounted more than once.
> + *
> + * This function was inspired by ext4 MMP feature. Because HA stack
> + * helps ocfs2 to manage nodes join/leave, so we only focus on MMP
> + * under nocluster mode.
> + * Another info is ocfs2 only uses slot 0 on nocuster mode.
> + *
> + * es_valid:
> + *  0: not available
> + *  1: valid, cluster mode
> + *  2: valid, nocluster mod> + *
> + * parameters:
> + *  osb: the struct ocfs2_super
> + *  noclustered: under noclustered mount
> + *  slot: prefer slot number

Sorry, above es_valid & parameters comments are outdated. I will remove them on next version.

> + */
> +int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered)


I forgot to run checkpatch.pl for all 4 patches. And checkpatch.pl reported many
format warning. I prefer these patches as draft version, let's discuss the mmp+noncluster
idea is workable/acceptable. If no one reject these code, I will file v1.

(For test cases, I put them in cover letter.)

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack" Heming Zhao via Ocfs2-devel
@ 2022-07-31 17:42   ` Mark Fasheh via Ocfs2-devel
  2022-08-01  1:01     ` heming.zhao--- via Ocfs2-devel
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Fasheh via Ocfs2-devel @ 2022-07-31 17:42 UTC (permalink / raw)
  To: Heming Zhao; +Cc: ocfs2-devel

Hi Heming,

On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
<ocfs2-devel@oss.oracle.com> wrote:
>
> the key different between local mount and non-clustered mount:
> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> convert job without ha stack. non-clustered mount feature can run
> totally without ha stack.

Can you please elaborate on this? Local mounts can run without a
cluster stack so I don't see the difference there. We have
tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
users data - that's a feature, not a bug. So what I'm seeing here is
just opening us to potential corruptions. Is there a specific use case
here that you're trying to account for? Are you fixing a particular
bug?

Thanks,
  --Mark

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-07-31 17:42   ` Mark Fasheh via Ocfs2-devel
@ 2022-08-01  1:01     ` heming.zhao--- via Ocfs2-devel
  2022-08-01  2:25       ` heming.zhao--- via Ocfs2-devel
  2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
  0 siblings, 2 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-08-01  1:01 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

Hello Mark,

On 8/1/22 01:42, Mark Fasheh wrote:
> Hi Heming,
> 
> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> <ocfs2-devel@oss.oracle.com> wrote:
>>
>> the key different between local mount and non-clustered mount:
>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>> convert job without ha stack. non-clustered mount feature can run
>> totally without ha stack.
> 
> Can you please elaborate on this? Local mounts can run without a
> cluster stack so I don't see the difference there. We have

I am using pacemaker cluster stack. In my env, the trouble of the converting between
local and clustered mounts are only happening on cluster stack.

the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
to mount volume at any env (with/without cluster stack).
The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
two nodes to set up a cluster.)

> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> users data - that's a feature, not a bug. So what I'm seeing here is
> just opening us to potential corruptions. Is there a specific use case
> here that you're trying to account for? Are you fixing a particular
> bug?
> 

Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.

 From my viewpoint, the non-clustered mount code is based on local mount code,
which gives more flexible than local mount. non-clustered mount uses unify mount
style align with clustered mount. I think users will like more to use non-clustered
mount than using tunefs.ocfs2 to change mount type.

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-01  1:01     ` heming.zhao--- via Ocfs2-devel
@ 2022-08-01  2:25       ` heming.zhao--- via Ocfs2-devel
  2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-08-01  2:25 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

I found I didn't include Joseph in previous mail.

On 8/1/22 09:01, heming.zhao--- via Ocfs2-devel wrote:
> Hello Mark,
> 
> On 8/1/22 01:42, Mark Fasheh wrote:
>> Hi Heming,
>>
>> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
>> <ocfs2-devel@oss.oracle.com> wrote:
>>>
>>> the key different between local mount and non-clustered mount:
>>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>>> convert job without ha stack. non-clustered mount feature can run
>>> totally without ha stack.
>>
>> Can you please elaborate on this? Local mounts can run without a
>> cluster stack so I don't see the difference there. We have
> 
> I am using pacemaker cluster stack. In my env, the trouble of the converting between
> local and clustered mounts are only happening on cluster stack.
> 
> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> to mount volume at any env (with/without cluster stack).
> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> two nodes to set up a cluster.)

Ooh, above "needs two nodes to set up a cluster" is wrong. User could use a special corosync.conf
to set up a single node cluster. But anyhow, the key is why we can't bypass the setup ha stack
step.

> 
>> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
>> users data - that's a feature, not a bug. So what I'm seeing here is
>> just opening us to potential corruptions. Is there a specific use case
>> here that you're trying to account for? Are you fixing a particular
>> bug?
>>
> 
> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
>  From my viewpoint, the non-clustered mount code is based on local mount code,
> which gives more flexible than local mount. non-clustered mount uses unify mount
> style align with clustered mount. I think users will like more to use non-clustered
> mount than using tunefs.ocfs2 to change mount type.
> 
> Thanks,
> Heming


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-01  1:01     ` heming.zhao--- via Ocfs2-devel
  2022-08-01  2:25       ` heming.zhao--- via Ocfs2-devel
@ 2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
  2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
                           ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Mark Fasheh via Ocfs2-devel @ 2022-08-04 23:53 UTC (permalink / raw)
  To: heming.zhao; +Cc: ocfs2-devel

Hi Heming,

On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> Hello Mark,
>
> On 8/1/22 01:42, Mark Fasheh wrote:
> > Hi Heming,
> >
> > On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> > <ocfs2-devel@oss.oracle.com> wrote:
> >>
> >> the key different between local mount and non-clustered mount:
> >> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> >> convert job without ha stack. non-clustered mount feature can run
> >> totally without ha stack.
> >
> > Can you please elaborate on this? Local mounts can run without a
> > cluster stack so I don't see the difference there. We have
>
> I am using pacemaker cluster stack. In my env, the trouble of the converting between
> local and clustered mounts are only happening on cluster stack.
>
> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> to mount volume at any env (with/without cluster stack).
> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> two nodes to set up a cluster.)

Ok. I had some time to look over the ext4 mmp patches. I feel like
there are two questions here:

1) Is MMP a useful feature for Ocfs2

My answer is yes absolutely, the user should have the option to enable
this on local mounts.


2) Should we allow the user to bypass our cluster checks?

On this question I'm still a 'no'. I simply haven't seen enough
evidence to warrant such a drastic change in policy. Allowing it via
mount option too just feels extremely error-prone. I think we need to
explore alternative avenues to help
ing the user out here. As you noted in your followup, a single node
config is entirely possible in pacemaker (I've run that config
myself). Why not provide an easy way for the user to drop down to that
sort of a config? I know that's kind
of pushing responsibility for this to the cluster stack, but that's
where it belongs in the first place.

Another option might be an 'observer mode' mount, where the node
participates in the cluster (and the file system locking) but purely
in a read-only fashion.


> > tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> > users data - that's a feature, not a bug. So what I'm seeing here is
> > just opening us to potential corruptions. Is there a specific use case
> > here that you're trying to account for? Are you fixing a particular
> > bug?
> >
>
> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.

FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
we should go back to that paradigm.


>  From my viewpoint, the non-clustered mount code is based on local mount code,
> which gives more flexible than local mount. non-clustered mount uses unify mount
> style align with clustered mount. I think users will like more to use non-clustered
> mount than using tunefs.ocfs2 to change mount type.

Can we get rid of the mount option, and make mmp something that users
can turn on for Ocfs2 local mounts? I don't recall if we make a slot
map for local mounts or not but it wouldn't be difficult to add that.

Btw, thank you very much for all of these patches, and also thank you
for the very detailed test cases in your initial email. I'll try to
give the actual code a review as well.

Thanks,
  --Mark

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
@ 2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
  2022-08-06 15:53           ` heming.zhao--- via Ocfs2-devel
  2022-08-06 16:20           ` Heming Zhao via Ocfs2-devel
  2022-08-06 15:44         ` heming.zhao--- via Ocfs2-devel
  2022-08-06 16:15         ` Heming Zhao via Ocfs2-devel
  2 siblings, 2 replies; 23+ messages in thread
From: Mark Fasheh via Ocfs2-devel @ 2022-08-05  4:11 UTC (permalink / raw)
  To: heming.zhao; +Cc: ocfs2-devel

On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
> 2) Should we allow the user to bypass our cluster checks?
>
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
>
> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.

Thinking about this some more... The only way that this works without
potential corruptions is if we always write a periodic mmp sequence,
even in clustered mode (which might mean each node writes to its own
sector). That way tunefs can always check the disk for a mounted node,
even without a cluster stack up. If tunefs sees anyone writing
sequences to the disk, it can safely fail the operation. Tunefs also
would have to be writing an mmp sequence once it has determined that
the disk is not mounted. It could also write some flag alongisde the
sequence that says 'tunefs is working on this disk'. If a cluster
mount comes up and sees a live sequence with that flag, it will know
to fail the mount request as the disk is being modified. Local mounts
can also use this to ensure that they are the only mounted node.

As it turns out, we already do pretty much all of the sequence writing
already for the o2cb cluster stack - check out cluseter/heartbeat.c.
If memory serves, tunefs.ocfs2 has code to write to this heartbeat
area as well. For o2cb, we use the disk heartbeat to detect node
liveness, and to kill our local node if we see disk timeouts. For
pcmk, we shouldn't take any of these actions as it is none of our
responsibility. Under pcmk, the heartbeating would be purely for mount
protection checks.

The downside to this is that all nodes would be heartbeating to the
disk on a regular interval, not just one. To be fair, this is exactly
how o2cb works and with the correct timeout choices, we were able to
avoid a measurable performance impact, though in any case this might
have to be a small price the user pays for cluster aware mount
protection.

Let me know what you think.

Thanks,
  --Mark

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
  2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
@ 2022-08-06 15:44         ` heming.zhao--- via Ocfs2-devel
  2022-08-06 16:15         ` Heming Zhao via Ocfs2-devel
  2 siblings, 0 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-08-06 15:44 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

Hi Mark,

On 8/5/22 07:53, Mark Fasheh wrote:
> Hi Heming,
> 
> On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
>>
>> Hello Mark,
>>
>> On 8/1/22 01:42, Mark Fasheh wrote:
>>> Hi Heming,
>>>
>>> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
>>> <ocfs2-devel@oss.oracle.com> wrote:
>>>>
>>>> the key different between local mount and non-clustered mount:
>>>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>>>> convert job without ha stack. non-clustered mount feature can run
>>>> totally without ha stack.
>>>
>>> Can you please elaborate on this? Local mounts can run without a
>>> cluster stack so I don't see the difference there. We have
>>
>> I am using pacemaker cluster stack. In my env, the trouble of the converting between
>> local and clustered mounts are only happening on cluster stack.
>>
>> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
>> to mount volume at any env (with/without cluster stack).
>> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
>> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
>> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
>> two nodes to set up a cluster.)
> 
> Ok. I had some time to look over the ext4 mmp patches. I feel like
> there are two questions here:
> 
> 1) Is MMP a useful feature for Ocfs2
> 
> My answer is yes absolutely, the user should have the option to enable
> this on local mounts.
> 

me too.
> 
> 2) Should we allow the user to bypass our cluster checks?
> 
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
> 

the reason for creating commit 912f655d78c5d4a is user didn't want to do any

set up job for cluster stack. So any HA solution (eg: automatically config

single node, auto install required software, create VM with already set up HA

stack, etc) is not remove the pain point. Base on bypass cluster setup, there only

left non-clustered mount. And in end user viewpoint, non-clustered mount is

also the most easy way to mount ocfs2 volume.

> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.
> 

In my feeling, the 'observer mode' is just read-only mount, we don't need to
create it.
For 912f655d78c5d4a, user take a snapshot for ocfs2 volume, and do health-check

on a different env (not production env). there may belong to two different network.

> 
>>> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
>>> users data - that's a feature, not a bug. So what I'm seeing here is
>>> just opening us to potential corruptions. Is there a specific use case
>>> here that you're trying to account for? Are you fixing a particular
>>> bug?
>>>
>>
>> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
>> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
> FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
> we should go back to that paradigm.
> 

If you or other maintainers prefer to pull out non-clustered mount, I respect

the decision.

> 
>>   From my viewpoint, the non-clustered mount code is based on local mount code,
>> which gives more flexible than local mount. non-clustered mount uses unify mount
>> style align with clustered mount. I think users will like more to use non-clustered
>> mount than using tunefs.ocfs2 to change mount type.
> 
> Can we get rid of the mount option, and make mmp something that users
> can turn on for Ocfs2 local mounts? I don't recall if we make a slot
> map for local mounts or not but it wouldn't be difficult to add that.
> 

 From tech side, it's possible to enable mmp in local mount.

Without 912f655d78c5d4a, local-mount uses slot 0 (the first available slot).

> Btw, thank you very much for all of these patches, and also thank you
> for the very detailed test cases in your initial email. I'll try to
> give the actual code a review as well.
> 
> Thanks,
>    --Mark

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
@ 2022-08-06 15:53           ` heming.zhao--- via Ocfs2-devel
  2022-08-06 16:20           ` Heming Zhao via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-08-06 15:53 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

Hi Mark,

On 8/5/22 12:11, Mark Fasheh wrote:
> On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
>> 2) Should we allow the user to bypass our cluster checks?
>>
>> On this question I'm still a 'no'. I simply haven't seen enough
>> evidence to warrant such a drastic change in policy. Allowing it via
>> mount option too just feels extremely error-prone. I think we need to
>> explore alternative avenues to help
>> ing the user out here. As you noted in your followup, a single node
>> config is entirely possible in pacemaker (I've run that config
>> myself). Why not provide an easy way for the user to drop down to that
>> sort of a config? I know that's kind
>> of pushing responsibility for this to the cluster stack, but that's
>> where it belongs in the first place.
>>
>> Another option might be an 'observer mode' mount, where the node
>> participates in the cluster (and the file system locking) but purely
>> in a read-only fashion.
> 
> Thinking about this some more... The only way that this works without
> potential corruptions is if we always write a periodic mmp sequence,
> even in clustered mode (which might mean each node writes to its own
> sector). That way tunefs can always check the disk for a mounted node,
> even without a cluster stack up. If tunefs sees anyone writing
> sequences to the disk, it can safely fail the operation. Tunefs also
> would have to be writing an mmp sequence once it has determined that
> the disk is not mounted. It could also write some flag alongisde the
> sequence that says 'tunefs is working on this disk'. If a cluster
> mount comes up and sees a live sequence with that flag, it will know
> to fail the mount request as the disk is being modified. Local mounts
> can also use this to ensure that they are the only mounted node.
> 

Above tunefs check & write seq steps are mmp feature work flow.



Your mentioned tunefs work flow matches my patch[4/4] idea, and this

patch does all the works in kernel ocfs2_find_slot().



Because sequences should be saved in SB, update it should grab ->osb_lock,

which influence the performance.

And for saving seqs, for saving space, we won't alloc different disk block for
different node.
  If multi-nodes share the same disk block (eg, keep using slot_map
for saving seqs),
  the updating job will make IO performance issue.



For fixing performance issue, in my view, we should disable mmp sequences

updating when mounting mode is clustered. So I make a rule:

** If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. **



above rule another meaning:

new coming node mounting type should same with exist mounting type.



and there is a basic knowledge:

current ocfs2 code under cluster stack already have the ability to prevent
multiple mount when mounting type is clustered.


(from patch 4/4) there are mount labels:

+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */

+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */

+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */

+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */

+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount

+                                              under MMP disabled */

+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount

+                                              under MMP disabled */



whenever mount successfully, there should be three types living labels

in slotmap area:

- OCFS2_MMP_SEQ_CLEAN, OCFS2_VALID_NOCLUSTER - for local/non-clustered mount

- OCFS2_VALID_CLUSTER - for clustered mount


new coming node will check if any slot contains living labels.

whenever unmount successfully, there should be two types left labels in slotmap
  area:
-
  OCFS2_MMP_SEQ_CLEAN or 0 (zero)



when a node does unmount, according to the mount type, it will clean (zeroed)

or write OCFS2_MMP_SEQ_CLEAN in the slot.

> As it turns out, we already do pretty much all of the sequence writing
> already for the o2cb cluster stack - check out cluseter/heartbeat.c.
> If memory serves, tunefs.ocfs2 has code to write to this heartbeat
> area as well. For o2cb, we use the disk heartbeat to detect node
> liveness, and to kill our local node if we see disk timeouts. For
> pcmk, we shouldn't take any of these actions as it is none of our
> responsibility. Under pcmk, the heartbeating would be purely for mount
> protection checks.
> 

 From my thinking, under cluster stack, there is enough protecting logic.
For local/non-clustered mount, mmp will give the ability for detecting node liveness.

So I only enable kmmpd-<dev> kthread for local/non-clustered mount.

> The downside to this is that all nodes would be heartbeating to the
> disk on a regular interval, not just one. To be fair, this is exactly
> how o2cb works and with the correct timeout choices, we were able to
> avoid a measurable performance impact, though in any case this might
> have to be a small price the user pays for cluster aware mount
> protection.

I already considered this performance issue in patch [4/4].

> 
> Let me know what you think.
> 
> Thanks,
>    --Mark

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
  2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
  2022-08-06 15:44         ` heming.zhao--- via Ocfs2-devel
@ 2022-08-06 16:15         ` Heming Zhao via Ocfs2-devel
  2 siblings, 0 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-08-06 16:15 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

Hello Mark and Joseph,

(please ignore my previous reply mail)
I may met thunderbird bug, I clearly remembered I added you (Joseph) in
CC for this mail & previous mail. But I only see Mark and ocfs2-devel in the
receiver list.  and the mail format also mess, thunderbird replace '\n' to '\r\n'.

For fixing this problem, I change neomutt to resend these two mails

On Thu, Aug 04, 2022 at 04:53:12PM -0700, Mark Fasheh wrote:
> Hi Heming,
> 
> On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
> >
> > Hello Mark,
> >
> > On 8/1/22 01:42, Mark Fasheh wrote:
> > > Hi Heming,
> > >
> > > On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> > > <ocfs2-devel@oss.oracle.com> wrote:
> > >>
> > >> the key different between local mount and non-clustered mount:
> > >> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> > >> convert job without ha stack. non-clustered mount feature can run
> > >> totally without ha stack.
> > >
> > > Can you please elaborate on this? Local mounts can run without a
> > > cluster stack so I don't see the difference there. We have
> >
> > I am using pacemaker cluster stack. In my env, the trouble of the converting between
> > local and clustered mounts are only happening on cluster stack.
> >
> > the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> > to mount volume at any env (with/without cluster stack).
> > The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> > ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> > want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> > two nodes to set up a cluster.)
> 
> Ok. I had some time to look over the ext4 mmp patches. I feel like
> there are two questions here:
> 
> 1) Is MMP a useful feature for Ocfs2
> 
> My answer is yes absolutely, the user should have the option to enable
> this on local mounts.
> 

me too.

> 
> 2) Should we allow the user to bypass our cluster checks?
> 
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
> 

the reason for creating commit 912f655d78c5d4a is user didn't want to do
any set up job for cluster stack. So any HA solution (eg: automatically
config single node, auto install required software, create VM with already set
up HA stack, etc) is not remove the pain point. Base on bypass cluster setup,
there only left non-clustered mount. And in end user viewpoint, non-clustered
mount is also the most easy way to mount ocfs2 volume. 

> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.
> 

In my feeling, the 'observer mode' is just read-only mount, we don't need to
create it.
For 912f655d78c5d4a, user take a snapshot for ocfs2 volume, and do health-check

on a different env (not production env). there may belong to two different network. 

> 
> > > tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> > > users data - that's a feature, not a bug. So what I'm seeing here is
> > > just opening us to potential corruptions. Is there a specific use case
> > > here that you're trying to account for? Are you fixing a particular
> > > bug?
> > >
> >
> > Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> > works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
> FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
> we should go back to that paradigm.

If you or other maintainers prefer to pull out non-clustered mount, I respect
the decision.

> 
> 
> >  From my viewpoint, the non-clustered mount code is based on local mount code,
> > which gives more flexible than local mount. non-clustered mount uses unify mount
> > style align with clustered mount. I think users will like more to use non-clustered
> > mount than using tunefs.ocfs2 to change mount type.
> 
> Can we get rid of the mount option, and make mmp something that users
> can turn on for Ocfs2 local mounts? I don't recall if we make a slot
> map for local mounts or not but it wouldn't be difficult to add that.
> 

From tech side, it's possible to enable mmp in local mount.

Without 912f655d78c5d4a, local-mount uses slot 0 (the first available slot). 

> Btw, thank you very much for all of these patches, and also thank you
> for the very detailed test cases in your initial email. I'll try to
> give the actual code a review as well.
> 
> Thanks,
>   --Mark

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
  2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
  2022-08-06 15:53           ` heming.zhao--- via Ocfs2-devel
@ 2022-08-06 16:20           ` Heming Zhao via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-08-06 16:20 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: ocfs2-devel

Hello Mark and Joseph,

please ignore my previous mess formated reply mail

On Thu, Aug 04, 2022 at 09:11:53PM -0700, Mark Fasheh wrote:
> On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
> > 2) Should we allow the user to bypass our cluster checks?
> >
> > On this question I'm still a 'no'. I simply haven't seen enough
> > evidence to warrant such a drastic change in policy. Allowing it via
> > mount option too just feels extremely error-prone. I think we need to
> > explore alternative avenues to help
> > ing the user out here. As you noted in your followup, a single node
> > config is entirely possible in pacemaker (I've run that config
> > myself). Why not provide an easy way for the user to drop down to that
> > sort of a config? I know that's kind
> > of pushing responsibility for this to the cluster stack, but that's
> > where it belongs in the first place.
> >
> > Another option might be an 'observer mode' mount, where the node
> > participates in the cluster (and the file system locking) but purely
> > in a read-only fashion.
> 
> Thinking about this some more... The only way that this works without
> potential corruptions is if we always write a periodic mmp sequence,
> even in clustered mode (which might mean each node writes to its own
> sector). That way tunefs can always check the disk for a mounted node,
> even without a cluster stack up. If tunefs sees anyone writing
> sequences to the disk, it can safely fail the operation. Tunefs also
> would have to be writing an mmp sequence once it has determined that
> the disk is not mounted. It could also write some flag alongisde the
> sequence that says 'tunefs is working on this disk'. If a cluster
> mount comes up and sees a live sequence with that flag, it will know
> to fail the mount request as the disk is being modified. Local mounts
> can also use this to ensure that they are the only mounted node.
> 

Above tunefs check & write seq steps are mmp feature work flow.

Your mentioned tunefs work flow matches my patch[4/4] idea, and this
patch does all the works in kernel ocfs2_find_slot().

Because sequences should be saved in SB, update it should grab ->osb_lock,
which influence the performance.

And for saving seqs, for saving space, we won't alloc different disk block for
different node. If multi-nodes share the same disk block (eg, keep using
slot_map for saving seqs), the updating job will make IO performance issue.

For fixing performance issue, in my view, we should disable mmp sequences
updating when mounting mode is clustered. So I make a rule:
** If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. **

above rule another meaning:
new coming node mounting type should same with exist mounting type.

and there is a basic knowledge:
current ocfs2 code under cluster stack already have the ability to prevent
multiple mount when mounting type is clustered.

(from patch 4/4) there are mount labels:
+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */
+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount
+                                              under MMP disabled */
+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount
+                                              under MMP disabled */

whenever mount successfully, there should be three types living labels
in slotmap area:
- OCFS2_MMP_SEQ_CLEAN, OCFS2_VALID_NOCLUSTER - for local/non-clustered mount
- OCFS2_VALID_CLUSTER - for clustered mount

new coming node will check if any slot contains living labels.

whenever unmount successfully, there should be two types left labels in slotmap area:
- OCFS2_MMP_SEQ_CLEAN or 0 (zero)

when a node does unmount, according to the mount type, it will clean (zeroed)
or write OCFS2_MMP_SEQ_CLEAN in the slot. 

> As it turns out, we already do pretty much all of the sequence writing
> already for the o2cb cluster stack - check out cluseter/heartbeat.c.
> If memory serves, tunefs.ocfs2 has code to write to this heartbeat
> area as well. For o2cb, we use the disk heartbeat to detect node
> liveness, and to kill our local node if we see disk timeouts. For
> pcmk, we shouldn't take any of these actions as it is none of our
> responsibility. Under pcmk, the heartbeating would be purely for mount
> protection checks.

From my thinking, under cluster stack, there is enough protecting logic.

For local/non-clustered mount, mmp will give the ability for detecting node
liveness. So I only enable kmmpd-<dev> kthread for local/non-clustered mount. 

> 
> The downside to this is that all nodes would be heartbeating to the
> disk on a regular interval, not just one. To be fair, this is exactly
> how o2cb works and with the correct timeout choices, we were able to
> avoid a measurable performance impact, though in any case this might
> have to be a small price the user pays for cluster aware mount
> protection.
> 

I already considered this performance issue in patch [4/4].

> Let me know what you think.
> 
> Thanks,
>   --Mark

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown Heming Zhao via Ocfs2-devel
@ 2022-08-08  6:51   ` Joseph Qi via Ocfs2-devel
  2022-08-08 12:09     ` Heming Zhao via Ocfs2-devel
  0 siblings, 1 reply; 23+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-08-08  6:51 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 7/30/22 9:14 AM, Heming Zhao wrote:
> On local mount mode, there is no dlm resource initalized. If
> ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
> flow will call ocfs2_dlm_shutdown(), then does dlm resource
> cleanup job, which will trigger kernel crash.
> 
> Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
> return error")

Should be put at the same line.

> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> ---
>  fs/ocfs2/dlmglue.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 801e60bab955..1438ac14940b 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
>  void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
>  			int hangup_pending)
>  {
> +	if (ocfs2_mount_local(osb))
> +		return;
> +

IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as
ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking
added by ocfs2_xxx_lock_res_init().

Before commit 0737e01de9c4, it seems this issue also exists since
osb->cconn is already set under local mount mode. 

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature Heming Zhao via Ocfs2-devel
  2022-07-31  9:13   ` heming.zhao--- via Ocfs2-devel
@ 2022-08-08  8:19   ` Joseph Qi via Ocfs2-devel
  2022-08-08  9:07     ` Heming Zhao via Ocfs2-devel
  1 sibling, 1 reply; 23+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-08-08  8:19 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 7/30/22 9:14 AM, Heming Zhao wrote:
> MMP (multiple mount protection) gives filesystem ability to prevent
> from being mounted multiple times.
> 
> For avoiding data corruption when non-clustered and/or clustered mount
> are happening at same time, this commit introduced MMP feature. MMP
> idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
> fs and also for compatible with existing slotmap feature, I did some
> optimization and modification when porting from ext4 to ocfs2.
> 
> For optimization:
> mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
> mode.
> 
> We set a rule:
> If last mount didn't do unmount, (eg: crash), the next mount MUST be
> same mount type.
> 
> At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
> mount shared volume without ha stack") mentioned issue.

I suggest we re-split this series (especially patch 3 and 4), but not
revive a buggy commit first and then another commit fixing it BTW.

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-08-08  8:19   ` Joseph Qi via Ocfs2-devel
@ 2022-08-08  9:07     ` Heming Zhao via Ocfs2-devel
  2022-08-08  9:26       ` Heming Zhao via Ocfs2-devel
  2022-08-08  9:29       ` Joseph Qi via Ocfs2-devel
  0 siblings, 2 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-08-08  9:07 UTC (permalink / raw)
  To: joseph.qi, ocfs2-devel

On Mon, Aug 08, 2022 at 04:19:22PM +0800, Joseph Qi wrote:
> 
> 
> On 7/30/22 9:14 AM, Heming Zhao wrote:
> > MMP (multiple mount protection) gives filesystem ability to prevent
> > from being mounted multiple times.
> > 
> > For avoiding data corruption when non-clustered and/or clustered mount
> > are happening at same time, this commit introduced MMP feature. MMP
> > idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
> > fs and also for compatible with existing slotmap feature, I did some
> > optimization and modification when porting from ext4 to ocfs2.
> > 
> > For optimization:
> > mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
> > mode.
> > 
> > We set a rule:
> > If last mount didn't do unmount, (eg: crash), the next mount MUST be
> > same mount type.
> > 
> > At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
> > mount shared volume without ha stack") mentioned issue.
> 
> I suggest we re-split this series (especially patch 3 and 4), but not
> revive a buggy commit first and then another commit fixing it BTW.
> 

No problem. I remembered this fixing style, and won't make this kind of
mistake again.

Do we need to wait for Mark's further comment?
It looks Mark didn't like non-clustered mount at his first reply, but he
showed another idea for non-clustered mount on the second reply. And I
am sure what's his meaning/attitude for non-clustered mount.

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-08-08  9:07     ` Heming Zhao via Ocfs2-devel
@ 2022-08-08  9:26       ` Heming Zhao via Ocfs2-devel
  2022-08-08  9:29       ` Joseph Qi via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-08-08  9:26 UTC (permalink / raw)
  To: joseph.qi, junxiao.bi, mark; +Cc: ocfs2-devel

Very weird problem. I added Junxiao & Mark in 'CC' field, but there is
only kept 'To' field. I tested to use CC to send mail for other mail address,
it worked well.

On Mon, Aug 08, 2022 at 05:07:14PM +0800, Heming Zhao via Ocfs2-devel wrote:
> On Mon, Aug 08, 2022 at 04:19:22PM +0800, Joseph Qi wrote:
> > 
> > 
> > On 7/30/22 9:14 AM, Heming Zhao wrote:
> > > MMP (multiple mount protection) gives filesystem ability to prevent
> > > from being mounted multiple times.
> > > 
> > > For avoiding data corruption when non-clustered and/or clustered mount
> > > are happening at same time, this commit introduced MMP feature. MMP
> > > idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
> > > fs and also for compatible with existing slotmap feature, I did some
> > > optimization and modification when porting from ext4 to ocfs2.
> > > 
> > > For optimization:
> > > mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
> > > mode.
> > > 
> > > We set a rule:
> > > If last mount didn't do unmount, (eg: crash), the next mount MUST be
> > > same mount type.
> > > 
> > > At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
> > > mount shared volume without ha stack") mentioned issue.
> > 
> > I suggest we re-split this series (especially patch 3 and 4), but not
> > revive a buggy commit first and then another commit fixing it BTW.
> > 
> 
> No problem. I remembered this fixing style, and won't make this kind of
> mistake again.
> 
> Do we need to wait for Mark's further comment?
> It looks Mark didn't like non-clustered mount at his first reply, but he
> showed another idea for non-clustered mount on the second reply. And I
> am sure what's his meaning/attitude for non-clustered mount.
> 
> Thanks,
> Heming
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
  2022-08-08  9:07     ` Heming Zhao via Ocfs2-devel
  2022-08-08  9:26       ` Heming Zhao via Ocfs2-devel
@ 2022-08-08  9:29       ` Joseph Qi via Ocfs2-devel
  1 sibling, 0 replies; 23+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-08-08  9:29 UTC (permalink / raw)
  To: Heming Zhao, ocfs2-devel



On 8/8/22 5:07 PM, Heming Zhao wrote:
> On Mon, Aug 08, 2022 at 04:19:22PM +0800, Joseph Qi wrote:
>>
>>
>> On 7/30/22 9:14 AM, Heming Zhao wrote:
>>> MMP (multiple mount protection) gives filesystem ability to prevent
>>> from being mounted multiple times.
>>>
>>> For avoiding data corruption when non-clustered and/or clustered mount
>>> are happening at same time, this commit introduced MMP feature. MMP
>>> idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
>>> fs and also for compatible with existing slotmap feature, I did some
>>> optimization and modification when porting from ext4 to ocfs2.
>>>
>>> For optimization:
>>> mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
>>> mode.
>>>
>>> We set a rule:
>>> If last mount didn't do unmount, (eg: crash), the next mount MUST be
>>> same mount type.
>>>
>>> At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
>>> mount shared volume without ha stack") mentioned issue.
>>
>> I suggest we re-split this series (especially patch 3 and 4), but not
>> revive a buggy commit first and then another commit fixing it BTW.
>>
> 
> No problem. I remembered this fixing style, and won't make this kind of
> mistake again.
> 
> Do we need to wait for Mark's further comment?
> It looks Mark didn't like non-clustered mount at his first reply, but he
> showed another idea for non-clustered mount on the second reply. And I
> am sure what's his meaning/attitude for non-clustered mount.
> 

Yes, let's take Mark's input first for the next version.

Thanks,
Joseph

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  2022-08-08  6:51   ` Joseph Qi via Ocfs2-devel
@ 2022-08-08 12:09     ` Heming Zhao via Ocfs2-devel
  2022-08-10  1:31       ` Joseph Qi via Ocfs2-devel
  0 siblings, 1 reply; 23+ messages in thread
From: Heming Zhao via Ocfs2-devel @ 2022-08-08 12:09 UTC (permalink / raw)
  To: Joseph Qi; +Cc: ocfs2-devel

On Mon, Aug 08, 2022 at 02:51:12PM +0800, Joseph Qi wrote:
> 
> 
> On 7/30/22 9:14 AM, Heming Zhao wrote:
> > On local mount mode, there is no dlm resource initalized. If
> > ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
> > flow will call ocfs2_dlm_shutdown(), then does dlm resource
> > cleanup job, which will trigger kernel crash.
> > 
> > Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
> > return error")
> 
> Should be put at the same line.
OK

> 
> > Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> > ---
> >  fs/ocfs2/dlmglue.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> > index 801e60bab955..1438ac14940b 100644
> > --- a/fs/ocfs2/dlmglue.c
> > +++ b/fs/ocfs2/dlmglue.c
> > @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
> >  void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
> >  			int hangup_pending)
> >  {
> > +	if (ocfs2_mount_local(osb))
> > +		return;
> > +
> 
> IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as
> ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking
> added by ocfs2_xxx_lock_res_init().
> 

ocfs2_dlm_shutdown does the cleanup job for ocfs2_dlm_init.
This patch fixed crash in local mount error flow.
In local mount mode, ocfs2_dlm_init does nothing, which should make
ocfs2_dlm_shutdown do nothing.

And I checked all calling ocfs2_dlm_shutdown cases:
1. mount flow:

   ocfs2_fill_super
    + xxx =fails=> label:out_super (checked, work fine)
    |
    + ocfs2_mount_volume =fails=> label:out_debugfs (checked, work fine)
    |  |
    |  + xxx =fails=> cleanup everything before returning
    |
    + xxx =fails=> label:out_dismout
         At this time, dlm has been init successfully, we can call all lines
         of ocfs2_dlm_shutdown.
    
   
2. ocfs2_dismount_volume => 'osb->cconn' is true.
   this MUST be dlm successfully init case. everything looks fine.

In previous mail/patch: [PATCH] test error handling during mount stage, I may
forget to test local mount mode. So this crash didn't be triggered.

> Before commit 0737e01de9c4, it seems this issue also exists since
> osb->cconn is already set under local mount mode. 

Yes. The bug exists since local mount feature was introduced, commit number:
c271c5c22b0a7ca45fda15f1f4d258bca36a5b94.  I will change 'Fixes' on next version.

(Hope 'CC' list takes effect for this mail. -_-# )

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  2022-08-08 12:09     ` Heming Zhao via Ocfs2-devel
@ 2022-08-10  1:31       ` Joseph Qi via Ocfs2-devel
  2022-08-10 23:52         ` heming.zhao--- via Ocfs2-devel
  0 siblings, 1 reply; 23+ messages in thread
From: Joseph Qi via Ocfs2-devel @ 2022-08-10  1:31 UTC (permalink / raw)
  To: Heming Zhao; +Cc: ocfs2-devel



On 8/8/22 8:09 PM, Heming Zhao wrote:
> On Mon, Aug 08, 2022 at 02:51:12PM +0800, Joseph Qi wrote:
>>
>>
>> On 7/30/22 9:14 AM, Heming Zhao wrote:
>>> On local mount mode, there is no dlm resource initalized. If
>>> ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
>>> flow will call ocfs2_dlm_shutdown(), then does dlm resource
>>> cleanup job, which will trigger kernel crash.
>>>
>>> Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
>>> return error")
>>
>> Should be put at the same line.
> OK
> 
>>
>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>> ---
>>>  fs/ocfs2/dlmglue.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 801e60bab955..1438ac14940b 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
>>>  void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
>>>  			int hangup_pending)
>>>  {
>>> +	if (ocfs2_mount_local(osb))
>>> +		return;
>>> +
>>
>> IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as
>> ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking
>> added by ocfs2_xxx_lock_res_init().
>>
> 
> ocfs2_dlm_shutdown does the cleanup job for ocfs2_dlm_init.
> This patch fixed crash in local mount error flow.
> In local mount mode, ocfs2_dlm_init does nothing, which should make
> ocfs2_dlm_shutdown do nothing.
> 
Not exactly, even in local mount we have to initialize lockres like
super lock.


> And I checked all calling ocfs2_dlm_shutdown cases:
> 1. mount flow:
> 
>    ocfs2_fill_super
>     + xxx =fails=> label:out_super (checked, work fine)
>     |
>     + ocfs2_mount_volume =fails=> label:out_debugfs (checked, work fine)
>     |  |
>     |  + xxx =fails=> cleanup everything before returning
>     |
>     + xxx =fails=> label:out_dismout
>          At this time, dlm has been init successfully, we can call all lines
>          of ocfs2_dlm_shutdown.
>     
>    
> 2. ocfs2_dismount_volume => 'osb->cconn' is true.
>    this MUST be dlm successfully init case. everything looks fine.
> 
> In previous mail/patch: [PATCH] test error handling during mount stage, I may
> forget to test local mount mode. So this crash didn't be triggered.
> 
>> Before commit 0737e01de9c4, it seems this issue also exists since
>> osb->cconn is already set under local mount mode. 
> 
> Yes. The bug exists since local mount feature was introduced, commit number:
> c271c5c22b0a7ca45fda15f1f4d258bca36a5b94.  I will change 'Fixes' on next version.
> 
This patch can be a separate fix and we don't have to bind with
nocluster mount feature, even though you've triggered it by related
local mount.

Thanks,
Joseph

> (Hope 'CC' list takes effect for this mail. -_-# )
> 
> Thanks,
> Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  2022-08-10  1:31       ` Joseph Qi via Ocfs2-devel
@ 2022-08-10 23:52         ` heming.zhao--- via Ocfs2-devel
  0 siblings, 0 replies; 23+ messages in thread
From: heming.zhao--- via Ocfs2-devel @ 2022-08-10 23:52 UTC (permalink / raw)
  To: Joseph Qi; +Cc: ocfs2-devel

On 8/10/22 09:31, Joseph Qi wrote:
> 
> 
> On 8/8/22 8:09 PM, Heming Zhao wrote:
>> On Mon, Aug 08, 2022 at 02:51:12PM +0800, Joseph Qi wrote:
>>>
>>>
>>> On 7/30/22 9:14 AM, Heming Zhao wrote:
>>>> On local mount mode, there is no dlm resource initalized. If
>>>> ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
>>>> flow will call ocfs2_dlm_shutdown(), then does dlm resource
>>>> cleanup job, which will trigger kernel crash.
>>>>
>>>> Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
>>>> return error")
>>>
>>> Should be put at the same line.
>> OK
>>
>>>
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> ---
>>>>   fs/ocfs2/dlmglue.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>>> index 801e60bab955..1438ac14940b 100644
>>>> --- a/fs/ocfs2/dlmglue.c
>>>> +++ b/fs/ocfs2/dlmglue.c
>>>> @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
>>>>   void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
>>>>   			int hangup_pending)
>>>>   {
>>>> +	if (ocfs2_mount_local(osb))
>>>> +		return;
>>>> +
>>>
>>> IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as
>>> ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking
>>> added by ocfs2_xxx_lock_res_init().
>>>
>>
>> ocfs2_dlm_shutdown does the cleanup job for ocfs2_dlm_init.
>> This patch fixed crash in local mount error flow.
>> In local mount mode, ocfs2_dlm_init does nothing, which should make
>> ocfs2_dlm_shutdown do nothing.
>>
> Not exactly, even in local mount we have to initialize lockres like
> super lock.

Thank you for your patient explaination, I understood your meaning, will change it in
next version.
I am very busy these days, it makes my brain stop working.

> 
> 
>> And I checked all calling ocfs2_dlm_shutdown cases:
>> 1. mount flow:
>>
>>     ocfs2_fill_super
>>      + xxx =fails=> label:out_super (checked, work fine)
>>      |
>>      + ocfs2_mount_volume =fails=> label:out_debugfs (checked, work fine)
>>      |  |
>>      |  + xxx =fails=> cleanup everything before returning
>>      |
>>      + xxx =fails=> label:out_dismout
>>           At this time, dlm has been init successfully, we can call all lines
>>           of ocfs2_dlm_shutdown.
>>      
>>     
>> 2. ocfs2_dismount_volume => 'osb->cconn' is true.
>>     this MUST be dlm successfully init case. everything looks fine.
>>
>> In previous mail/patch: [PATCH] test error handling during mount stage, I may
>> forget to test local mount mode. So this crash didn't be triggered.
>>
>>> Before commit 0737e01de9c4, it seems this issue also exists since
>>> osb->cconn is already set under local mount mode.
>>
>> Yes. The bug exists since local mount feature was introduced, commit number:
>> c271c5c22b0a7ca45fda15f1f4d258bca36a5b94.  I will change 'Fixes' on next version.
>>
> This patch can be a separate fix and we don't have to bind with
> nocluster mount feature, even though you've triggered it by related
> local mount.

OK.

> 
> Thanks,
> Joseph
> 
>> (Hope 'CC' list takes effect for this mail. -_-# )
>>
>> Thanks,
>> Heming

Thanks,
Heming

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-08-10 23:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-30  1:14 [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support Heming Zhao via Ocfs2-devel
2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown Heming Zhao via Ocfs2-devel
2022-08-08  6:51   ` Joseph Qi via Ocfs2-devel
2022-08-08 12:09     ` Heming Zhao via Ocfs2-devel
2022-08-10  1:31       ` Joseph Qi via Ocfs2-devel
2022-08-10 23:52         ` heming.zhao--- via Ocfs2-devel
2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support Heming Zhao via Ocfs2-devel
2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack" Heming Zhao via Ocfs2-devel
2022-07-31 17:42   ` Mark Fasheh via Ocfs2-devel
2022-08-01  1:01     ` heming.zhao--- via Ocfs2-devel
2022-08-01  2:25       ` heming.zhao--- via Ocfs2-devel
2022-08-04 23:53       ` Mark Fasheh via Ocfs2-devel
2022-08-05  4:11         ` Mark Fasheh via Ocfs2-devel
2022-08-06 15:53           ` heming.zhao--- via Ocfs2-devel
2022-08-06 16:20           ` Heming Zhao via Ocfs2-devel
2022-08-06 15:44         ` heming.zhao--- via Ocfs2-devel
2022-08-06 16:15         ` Heming Zhao via Ocfs2-devel
2022-07-30  1:14 ` [Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature Heming Zhao via Ocfs2-devel
2022-07-31  9:13   ` heming.zhao--- via Ocfs2-devel
2022-08-08  8:19   ` Joseph Qi via Ocfs2-devel
2022-08-08  9:07     ` Heming Zhao via Ocfs2-devel
2022-08-08  9:26       ` Heming Zhao via Ocfs2-devel
2022-08-08  9:29       ` Joseph Qi via Ocfs2-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).