[Ocfs2-devel] Mixed mounts w/ different physical block sizes (long post)

* [Ocfs2-devel] Mixed mounts w/ different physical block sizes (long post)
       [not found] ` <9e885fc1-5701-c985-f63f-b90767253328@rentapacs.de>
@ 2017-09-18 15:43   ` Michael Ulbrich
  2017-09-19  3:32     ` Changwei Ge
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Ulbrich @ 2017-09-18 15:43 UTC (permalink / raw)
  To: ocfs2-devel

Hi again,

chatting with a helpful person on #ocfs2 IRC channel this morning  I got
encouraged to cross-post to ocsf2-devel. For historic background and
further details pls. see my two previous posts to ocfs2-users from last
week which are unanswered so far.

According to my current state of inspection I changed the topic from

"Node 8 doesn't mount / Wrong slot map assignment" to the current "Mixed
mounts ..."

Here we go:

I've learnt that large hard disks in increasing number come formatted w/
4k physical blocks size.

Now I've created an ocfs2 shared file system on top of drbd on a RAID1
of two 6 TB disks with such 4k physical block size. File system creation
was done on a hypervisor which actually saw the device as having 4k
physical sector size.

I'm using the default o2cb cluster stack. Version is ocfs2 1.6.4 on
stock Debian 8.

A node (numbered "1" in cluster.conf) which mounts this device with 4k
phys. blocks leads to a strange "times 8" numbering when checking
heartbeat debug info with 'echo "hb" | debugfs.ocfs2 -n /dev/drbd1':

hb
        node: node              seq       generation checksum
           8:    1 0000000059bfd253 00bfa1b63f30e494 c518c55a

I'm not sure why the first 2 columns are named "node:" and "node" but
assume the first "node:" is an index into some internal data structure
(slot map ?, heartbeat region ?) while the second "node" column shows
the actual node number as given in cluster.conf

Now a second node mounts the shared file system again as 4k block device:

hb
        node: node              seq       generation checksum
           8:    1 0000000059bfd36a 00bfa1b63f30e494 d4f79d63
          16:    2 0000000059bfd369 7acf8521da342228 4b8cd74d

As it actually happened in my setup of a two node cluster with 2
hypervisors and  3 virtual machines on top of each (8 nodes in total),
when mounting the fs on the first virtual machine with node number 3 we get:

hb
        node: node              seq       generation checksum
           3:    3 0000000059bfd413 59eb77b4db07884b 87a5057d
           8:    1 0000000059bfd412 00bfa1b63f30e494 e782d86e
          16:    2 0000000059bfd413 7acf8521da342228 cd48c018

Uhm, ... wait ... 3 ??

Mounting on further VMs (nodes 4, 5, 6 and 7) leads to:

hb
        node: node              seq       generation checksum
           3:    3 0000000059bfd413 59eb77b4db07884b 87a5057d
           4:    4 0000000059bfd413 debf95d5ff50dc10 3839c791
           5:    5 0000000059bfd414 529a98c758325d5b 60080c42
           6:    6 0000000059bfd412 14acfb487fa8c8b8 f54cef9d
           7:    7 0000000059bfd413 4d2d36de0b0d6b2e 3f1ad275
           8:    1 0000000059bfd412 00bfa1b63f30e494 e782d86e
          16:    2 0000000059bfd413 7acf8521da342228 cd48c018

Up to this point I did not notice any error or warning in the machines'
console or kernel logs.

And then trying to mount on node 8 finally there's an error:

kern.log node 1:

(o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is
heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c,
0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b)

kern.log node 8:

ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode.
(o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is
heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c,
0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c)

(actual seq and generation are not from above hb debug dump)

Now we have a conflict on slot 8.

When I encountered this error for the first time, I didn't know about
heartbeat debug info, slot maps or heartbeat regions and had no idea
what might have gone wrong so I started experimenting and found a
"solution" by swapping nodes 1 <-> 8 in cluster.conf. This leads to the
following layout of the heartbeat region (?):

hb
        node: node              seq       generation checksum
           1:    1 0000000059bfd412 00bfa1b63f30e494 e782d86e
           3:    3 0000000059bfd413 59eb77b4db07884b 87a5057d
           4:    4 0000000059bfd413 debf95d5ff50dc10 3839c791
           5:    5 0000000059bfd414 529a98c758325d5b 60080c42
           6:    6 0000000059bfd412 14acfb487fa8c8b8 f54cef9d
           7:    7 0000000059bfd413 4d2d36de0b0d6b2e 3f1ad275
          16:    2 0000000059bfd413 7acf8521da342228 cd48c018
          64:    8 0000000059bfd413 73a63eb550a33095 f4e074d1

Voila - all 8 nodes mounted, problem solved - let's continue with
getting this cluster ready for production ...

As it turned out this was in no way a stable configuration in that after
few weeks spurious reboots (fencing peer) started to happen (drbd losing
its replication connection, all kinds of weird kernel oopses and panics
from drbd and ocfs2). Reboots were usually preceded by burst of errors like:

Sep 11 00:01:27 web1 kernel: [ 9697.644436]
(o2hb-10254DCA50,515,1):o2hb_check_own_slot:582 ERROR: Heartbeat
sequence mismatch on device (vdc): expected(3:0x743493e99d19e721,
0x59b5b635), ondisk(3:0x743493e99d19e721, 0x59b5b633)
Sep 11 00:03:43 web1 kernel: [ 9833.918668]
(o2hb-10254DCA50,515,1):o2hb_check_own_slot:582 ERROR: Heartbeat
sequence mismatch on device (vdc): expected(3:0x743493e99d19e721,
0x59b5b6bd), ondisk(3:0x743493e99d19e721, 0x59b5b6bb)
Sep 11 00:03:45 web1 kernel: [ 9835.920551]
(o2hb-10254DCA50,515,1):o2hb_check_own_slot:582 ERROR: Heartbeat
sequence mismatch on device (vdc): expected(3:0x743493e99d19e721,
0x59b5b6bf), ondisk(3:0x743493e99d19e721, 0x59b5b6bb)
Sep 11 00:09:10 web1 kernel: [10160.576453]
(o2hb-10254DCA50,515,0):o2hb_check_own_slot:582 ERROR: Heartbeat
sequence mismatch on device (vdc): expected(3:0x743493e99d19e721,
0x59b5b804), ondisk(3:0x743493e99d19e721, 0x59b5b802)

In the end the ocfs2 filesystem had to be rebuilt to get rid of the
errors. It went ok for a while before the same symptoms of fs corruption
came back again.

To make a long story short: we found out that the virtual machines did
not see the disk device having 4k sectors but the standard 512 byte
blocks. So we had what I coined a "mixed mount" of the same ocfs2 file
system: 2 nodes mounted with 4k phys. block size the other 6 nodes
mounted w/ 512 byte block size.

Configuring the VMs with:

<blockio logical_block_size='4096' physical_block_size='4096'/>

leads to a heartbeat slot map:

hb
        node: node              seq       generation checksum
           8:    1 0000000059bfd412 00bfa1b63f30e494 e782d86e
          16:    2 0000000059bfd413 7acf8521da342228 cd48c018
          24:    3 0000000059bfd413 59eb77b4db07884b 87a5057d
          32:    4 0000000059bfd413 debf95d5ff50dc10 3839c791
          40:    5 0000000059bfd414 529a98c758325d5b 60080c42
          48:    6 0000000059bfd412 14acfb487fa8c8b8 f54cef9d
          56:    7 0000000059bfd413 4d2d36de0b0d6b2e 3f1ad275
          64:    8 0000000059bfd413 73a63eb550a33095 f4e074d1

Operation is stable so far. No 'Heartbeat sequence mismatch' errors.
Still strange the "times 8" values in column "node:" but this may be a
purely aesthetical issue.

Browsing the code of heartbeat.c I'm not sure if such a "mixed mount" is
*supposed* to work and it's just a minor bug we triggered that can
easily be fixed - or if such a scenario is a definite no-no and should
seriously be avoided. In the latter case an error message and cancelling
of an inappropriate mount operation would be very helpful.

Anyway, it would be greatly appreciated to hear a knowledgeable opinion
from the members of the ocfs2-devel list on this topic - any takers?

Thanks in advance + Best regards ... Michael

^ permalink raw reply	[flat|nested] 3+ messages in thread