Sharing ext4 on target storage to multiple initiators using NVMeoF

All of lore.kernel.org
 help / color / mirror / Atom feed

* Sharing ext4 on target storage to multiple initiators using NVMeoF
@ 2019-09-16 14:33 Daegyu Han
  2019-09-16 19:23 ` Eric Sandeen
  2019-09-17  6:48 ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Daegyu Han @ 2019-09-16 14:33 UTC (permalink / raw)
  To: linux-fsdevel

Hi linux file system experts,

I want to share ext4 on the storage server to multiple initiators(node
A,B) using NVMeoF.
Node A will write file to ext4 on the storage server, and I will mount
read-only option on Node B.

Actually, the reason I do this is for a prototype test.

I can't see the file's dentry and inode written in Node A on Node B
unless remount(umount and then mount) it.

Why is that?

I think if there is file system cache(dentry, inode) on Node B, then
disk IO will occur to read the data written by Node A.

Curiously, drop cache on Node B and do blockdev --flushbufs, then I
can access the file written by Node A.

I checked the kernel code and found that flushbufs incurs
sync_filesystem() which flushes the superblock and all dirty file
system caches.

Should the superblock data structure be flushed (updated) when
accessing the disk inode?

I wonder why this happens.

Regards,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF
  2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han
@ 2019-09-16 19:23 ` Eric Sandeen
  2019-09-17  0:44   ` Daegyu Han
  2019-09-17  6:48 ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2019-09-16 19:23 UTC (permalink / raw)
  To: Daegyu Han, linux-fsdevel



On 9/16/19 9:33 AM, Daegyu Han wrote:
> Hi linux file system experts,
> 
> I want to share ext4 on the storage server to multiple initiators(node
> A,B) using NVMeoF.
> Node A will write file to ext4 on the storage server, and I will mount
> read-only option on Node B.
> 
> Actually, the reason I do this is for a prototype test.
> 
> I can't see the file's dentry and inode written in Node A on Node B
> unless remount(umount and then mount) it.
> 
> Why is that?

Caching, metadata journaling, etc.

What you are trying to do will not work.

> I think if there is file system cache(dentry, inode) on Node B, then
> disk IO will occur to read the data written by Node A.

why would it?  there is no coordination between the nodes.  ext4 is
not a clustered filesystem.

> Curiously, drop cache on Node B and do blockdev --flushbufs, then I
> can access the file written by Node A.
> 
> I checked the kernel code and found that flushbufs incurs
> sync_filesystem() which flushes the superblock and all dirty file
> system caches.
> 
> Should the superblock data structure be flushed (updated) when
> accessing the disk inode?

It has nothing to do w/ the superblock.

> I wonder why this happens.

ext4 cannot be used for what you're trying to do.

-Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF
  2019-09-16 19:23 ` Eric Sandeen
@ 2019-09-17  0:44   ` Daegyu Han
  2019-09-17 12:54     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 6+ messages in thread
From: Daegyu Han @ 2019-09-17  0:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-fsdevel

It started with my curiosity.
I know this is not the right way to use a local filesystem and someone
would feel weird.
I just wanted to organize the situation and experiment like that.

I thought it would work if I flushed Node B's cached file system
metadata with the drop cache, but I didn't.

I've googled for something other than the mount and unmount process,
and I saw a StackOverflow article telling file systems to sync via
blockdev --flushbufs.

So I do the blockdev --flushbufs after the drop cache.
However, I still do not know why I can read the data stored in the
shared storage via Node B.

Thank you,

2019-09-17 4:23 GMT+09:00, Eric Sandeen <sandeen@sandeen.net>:
>
>
> On 9/16/19 9:33 AM, Daegyu Han wrote:
>> Hi linux file system experts,
>>
>> I want to share ext4 on the storage server to multiple initiators(node
>> A,B) using NVMeoF.
>> Node A will write file to ext4 on the storage server, and I will mount
>> read-only option on Node B.
>>
>> Actually, the reason I do this is for a prototype test.
>>
>> I can't see the file's dentry and inode written in Node A on Node B
>> unless remount(umount and then mount) it.
>>
>> Why is that?
>
> Caching, metadata journaling, etc.
>
> What you are trying to do will not work.
>
>> I think if there is file system cache(dentry, inode) on Node B, then
>> disk IO will occur to read the data written by Node A.
>
> why would it?  there is no coordination between the nodes.  ext4 is
> not a clustered filesystem.
>
>> Curiously, drop cache on Node B and do blockdev --flushbufs, then I
>> can access the file written by Node A.
>>
>> I checked the kernel code and found that flushbufs incurs
>> sync_filesystem() which flushes the superblock and all dirty file
>> system caches.
>>
>> Should the superblock data structure be flushed (updated) when
>> accessing the disk inode?
>
> It has nothing to do w/ the superblock.
>
>> I wonder why this happens.
>
> ext4 cannot be used for what you're trying to do.
>
> -Eric
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF
  2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han
  2019-09-16 19:23 ` Eric Sandeen
@ 2019-09-17  6:48 ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2019-09-17  6:48 UTC (permalink / raw)
  To: Daegyu Han; +Cc: linux-fsdevel

You might want to look into the pnfs block layout instead to do this
safely.  It is supported with XFS out of the box, but adding ext4
support shouldn't be all that hard.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF
  2019-09-17  0:44   ` Daegyu Han
@ 2019-09-17 12:54     ` Theodore Y. Ts'o
  2019-09-17 15:38       ` Daegyu Han
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Y. Ts'o @ 2019-09-17 12:54 UTC (permalink / raw)
  To: Daegyu Han; +Cc: Eric Sandeen, linux-fsdevel

On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote:
> It started with my curiosity.
> I know this is not the right way to use a local filesystem and someone
> would feel weird.
> I just wanted to organize the situation and experiment like that.
> 
> I thought it would work if I flushed Node B's cached file system
> metadata with the drop cache, but I didn't.
> 
> I've googled for something other than the mount and unmount process,
> and I saw a StackOverflow article telling file systems to sync via
> blockdev --flushbufs.
> 
> So I do the blockdev --flushbufs after the drop cache.
> However, I still do not know why I can read the data stored in the
> shared storage via Node B.

There are many problems, but the primary one is that Node B has
caches.  If it has a cached version of the inode table block, why
should it reread it after Node A has modified it?  Also, the VFS also
has negative dentry caches.  This is very important for search path
performance.  Consider for example the compiler which may need to look
in many directories for a particular header file.  If the C program has:

#include "amazing.h"

The C compiler may need to look in a dozen or more directories trying
to find the header file amazing.h.  And each successive C compiler
process will need to keep looking in all of those same directories.
So the kernel will keep a "negative cache", so if
/usr/include/amazing.h doesn't exist, it won't ask the file system
when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open
/usr/include/amazing.h.

You can disable all of the caches, but that makes the file system
terribly, terribly slow.  What network file systems will do is they
have schemes whereby they can safely cache, since the network file
system protocol has a way that the client can be told that their
cached information must be reread.  Local disk file systems don't have
anything like this.

There are shared-disk file systems that are designed for
multi-initiator setups.  Examples of this include gfs and ocfs2 in
Linux.  You will find that they often trade performance for
scalability to support multiple initiators.

You can use ext4 for fallback schemes, where the primary server has
exclusive access to the disk, and when the primary dies, the fallback
server can take over.  The ext4 multi-mount protection scheme is
designed for those sorts of use cases, and it's used by Lustre
servers.  But only one system is actively reading or writing to the
disk at a time, and the fallback server has to replay the journal, and
assure that primary server won't "come back to life".  Those are
sometimes called STONITH schemes ("shoot the other node in the head"),
and might involve network controlled power strips, etc.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Sharing ext4 on target storage to multiple initiators using NVMeoF
  2019-09-17 12:54     ` Theodore Y. Ts'o
@ 2019-09-17 15:38       ` Daegyu Han
  0 siblings, 0 replies; 6+ messages in thread
From: Daegyu Han @ 2019-09-17 15:38 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: linux-fsdevel

Thank you for the clear explanation.

Best regards,

Daegyu

2019-09-17 21:54 GMT+09:00, Theodore Y. Ts'o <tytso@mit.edu>:
> On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote:
>> It started with my curiosity.
>> I know this is not the right way to use a local filesystem and someone
>> would feel weird.
>> I just wanted to organize the situation and experiment like that.
>>
>> I thought it would work if I flushed Node B's cached file system
>> metadata with the drop cache, but I didn't.
>>
>> I've googled for something other than the mount and unmount process,
>> and I saw a StackOverflow article telling file systems to sync via
>> blockdev --flushbufs.
>>
>> So I do the blockdev --flushbufs after the drop cache.
>> However, I still do not know why I can read the data stored in the
>> shared storage via Node B.
>
> There are many problems, but the primary one is that Node B has
> caches.  If it has a cached version of the inode table block, why
> should it reread it after Node A has modified it?  Also, the VFS also
> has negative dentry caches.  This is very important for search path
> performance.  Consider for example the compiler which may need to look
> in many directories for a particular header file.  If the C program has:
>
> #include "amazing.h"
>
> The C compiler may need to look in a dozen or more directories trying
> to find the header file amazing.h.  And each successive C compiler
> process will need to keep looking in all of those same directories.
> So the kernel will keep a "negative cache", so if
> /usr/include/amazing.h doesn't exist, it won't ask the file system
> when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open
> /usr/include/amazing.h.
>
> You can disable all of the caches, but that makes the file system
> terribly, terribly slow.  What network file systems will do is they
> have schemes whereby they can safely cache, since the network file
> system protocol has a way that the client can be told that their
> cached information must be reread.  Local disk file systems don't have
> anything like this.
>
> There are shared-disk file systems that are designed for
> multi-initiator setups.  Examples of this include gfs and ocfs2 in
> Linux.  You will find that they often trade performance for
> scalability to support multiple initiators.
>
> You can use ext4 for fallback schemes, where the primary server has
> exclusive access to the disk, and when the primary dies, the fallback
> server can take over.  The ext4 multi-mount protection scheme is
> designed for those sorts of use cases, and it's used by Lustre
> servers.  But only one system is actively reading or writing to the
> disk at a time, and the fallback server has to replay the journal, and
> assure that primary server won't "come back to life".  Those are
> sometimes called STONITH schemes ("shoot the other node in the head"),
> and might involve network controlled power strips, etc.
>
> Regards,
>
> 						- Ted
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-17 15:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-16 14:33 Sharing ext4 on target storage to multiple initiators using NVMeoF Daegyu Han
2019-09-16 19:23 ` Eric Sandeen
2019-09-17  0:44   ` Daegyu Han
2019-09-17 12:54     ` Theodore Y. Ts'o
2019-09-17 15:38       ` Daegyu Han
2019-09-17  6:48 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.