Re: Please add the zuf tree to linux-next

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Please add the zuf tree to linux-next
       [not found] <1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com>
@ 2019-10-24  2:36 ` Christoph Hellwig
  2019-10-29  5:07   ` Stephen Rothwell
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2019-10-24  2:36 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Stephen Rothwell, linux-fsdevel, Miklos Szeredi, Alexander Viro,
	linux-kernel

On Thu, Oct 24, 2019 at 03:34:29AM +0300, Boaz Harrosh wrote:
> Hello Stephen
> 
> Please add the zuf tree below to the linux-next tree.
> 	[https://github.com/NetApp/zufs-zuf zuf]

I don't remember us coming to the conclusion that this actually is
useful doesn't just badly duplicate the fuse functionality.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-10-24  2:36 ` Please add the zuf tree to linux-next Christoph Hellwig
@ 2019-10-29  5:07   ` Stephen Rothwell
  2019-10-29  5:53     ` Christoph Hellwig
  2019-11-14 14:02     ` Boaz Harrosh
  0 siblings, 2 replies; 8+ messages in thread
From: Stephen Rothwell @ 2019-10-29  5:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Boaz Harrosh, linux-fsdevel, Miklos Szeredi, Alexander Viro,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 536 bytes --]

Hi Christoph,

On Wed, 23 Oct 2019 19:36:06 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Oct 24, 2019 at 03:34:29AM +0300, Boaz Harrosh wrote:
> > Hello Stephen
> > 
> > Please add the zuf tree below to the linux-next tree.
> > 	[https://github.com/NetApp/zufs-zuf zuf]  
> 
> I don't remember us coming to the conclusion that this actually is
> useful doesn't just badly duplicate the fuse functionality.

So is that a hard Nak on inclusion in linux-next at this time?

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-10-29  5:07   ` Stephen Rothwell
@ 2019-10-29  5:53     ` Christoph Hellwig
  2019-11-14 14:02     ` Boaz Harrosh
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2019-10-29  5:53 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Christoph Hellwig, Boaz Harrosh, linux-fsdevel, Miklos Szeredi,
	Alexander Viro, linux-kernel

On Tue, Oct 29, 2019 at 04:07:33PM +1100, Stephen Rothwell wrote:
> > > Please add the zuf tree below to the linux-next tree.
> > > 	[https://github.com/NetApp/zufs-zuf zuf]  
> > 
> > I don't remember us coming to the conclusion that this actually is
> > useful doesn't just badly duplicate the fuse functionality.
> 
> So is that a hard Nak on inclusion in linux-next at this time?

As far as I'm concerned yes.  In the end we'll need to find rough
consensus as I'm not the only one to decide, though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-10-29  5:07   ` Stephen Rothwell
  2019-10-29  5:53     ` Christoph Hellwig
@ 2019-11-14 14:02     ` Boaz Harrosh
  2019-11-14 14:56       ` Miklos Szeredi
  1 sibling, 1 reply; 8+ messages in thread
From: Boaz Harrosh @ 2019-11-14 14:02 UTC (permalink / raw)
  To: Stephen Rothwell, Christoph Hellwig, Miklos Szeredi,
	Linus Torvalds, Dave Chinner
  Cc: Boaz Harrosh, linux-fsdevel, Alexander Viro, linux-kernel

On 29/10/2019 07:07, Stephen Rothwell wrote:
> Hi Christoph,
> 
> On Wed, 23 Oct 2019 19:36:06 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Thu, Oct 24, 2019 at 03:34:29AM +0300, Boaz Harrosh wrote:
>>> Hello Stephen
>>>
>>> Please add the zuf tree below to the linux-next tree.
>>> 	[https://github.com/NetApp/zufs-zuf zuf]  
>>

Sorry for the late response was very sick for a few weeks, now doing better

>> I don't remember us coming to the conclusion that this actually is
>> useful doesn't just badly duplicate the fuse functionality.
> 

Dear Sir Christoph

ZUFS is not at *all* a duplication of the FUSE functionality. In fact they are
almost completely complementary. The systems that would benefit from fuse would
do poorly under zufs. And the systems that benefit from zufs do very *very* poorly
under fuse.
From the get go I have explained on the mailing list and to the guys that a fuse
replacement would just be a waist of time. That those async in nature, need page-cache
not sensitive to latency Systems are better with fuse. And those Systems that need
very low latency, zero copy, sync operations, highly parallel will do very poorly under
fuse and we need to invent a new multy-dimentional wheel to address those.

ZUFS was never a "better-fuse". It was from the get go a different animal to address
systems and demands that are not possible under fuse.

ZUFS is also (as opposed to fuse) A new way to communicate with User-mode servers, not
necessarily FileSystems. It does implement the full FileSystem API but any server, Say
MySQL under ZUFS will benefit from a low-latency, throughput and parallelizm unseen
before. This is because at the core it is a zero-copy synchronous IPC between applications.

And specially it is good with pmem. A pmem-only (NvDIMM based) FS running in user mode
gives me *better* results then XFS-DAX in Kernel. Now how is that possible?
(Under a zufs ported pmfs2)
I guess we did not do such a "BAD" job as you were so happily declaring.

The Linux Kernel was always about choice and diversity. There is a very respectable
place for both fuse and zufs side by side tackling different workloads and setups.
In fact, for example, EXT4 and XFS have 95% overlapping functionality. But we both know
that those places where XFS is king EXT4 can't get close, Yet there are still places that
EXT4 does better then XFS, such as single local disk, embedded systems, lighter wait ...
ZUFS and FUSE have maybe at the most 20% over lap in functionality. They are not even
cousins.

So please why do you make such bold statements, which are not true. And clearly you
have not studied the subject at all. I do not remember you ever participated at one of
my talks? Or gave your opinion on the subject, since the 2 years I have first sent
the RFD about the subject. (2.5 years)

At the last LSF. Steven from Red-Hat asked me to talk with Miklos about the fuse vs zufs.
We had a long talk where I have explained to him in detail How we do the mounting, how
Kernel owns the multy-devices. How we do the PMEM API and our IO API in general. How
we do pigi-back operations to minimize latencies. How we do DAX and mmap. At the end of the
talk he said to me that he understands how this is very different from FUSE and he wished
me "good luck".

Miklos - you have seen both projects; do you think that All these new subsystems from ZUFS
can have a comfortable place under FUSE, including the new IO API?
Believe me I have tried. I am a most lazy person. I would not have slaved on ZUFS for
2 years if it was a "badly duplicate the fuse functionality". Why would I?

Latest fuse already took some very good ideas from ZUFS. We believe this is a very good
project to have in the Kernel with new innovation.

But Dearest Christoph. I have learned to trust your "guts" about things. Please look
deeper into the subject (Perhaps review the code) and try to explain better what are your
real concerns. Perhaps we can address them?

> So is that a hard Nak on inclusion in linux-next at this time?
> 

I do not see what is the harm to anyone if it is to be included in linux-next?
Would you please help me in testing and stabilizing a very serious and ambitious project.
That has merit and is used by clients. I believe it is a very low risk project for the reset
of the Kernel. If not we can remove it very fast.

Cheers
Boaz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-11-14 14:02     ` Boaz Harrosh
@ 2019-11-14 14:56       ` Miklos Szeredi
  2019-11-14 16:04         ` Boaz Harrosh
  0 siblings, 1 reply; 8+ messages in thread
From: Miklos Szeredi @ 2019-11-14 14:56 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Stephen Rothwell, Christoph Hellwig, Linus Torvalds,
	Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On Thu, Nov 14, 2019 at 3:02 PM Boaz Harrosh <boaz@plexistor.com> wrote:

> At the last LSF. Steven from Red-Hat asked me to talk with Miklos about the fuse vs zufs.
> We had a long talk where I have explained to him in detail How we do the mounting, how
> Kernel owns the multy-devices. How we do the PMEM API and our IO API in general. How
> we do pigi-back operations to minimize latencies. How we do DAX and mmap. At the end of the
> talk he said to me that he understands how this is very different from FUSE and he wished
> me "good luck".
>
> Miklos - you have seen both projects; do you think that All these new subsystems from ZUFS
> can have a comfortable place under FUSE, including the new IO API?

It is quite true that ZUFS includes a lot of innovative ideas to
improve the performance of a certain class of userspace filesystems.
I think most, if not all of those ideas could be applied to the fuse
implementation as well, but I can understand why this hasn't been
done.  Fuse is in serious need of a cleanup, which I've started to do,
but it's not there yet...

One of the major issues that I brought up when originally reviewing
ZUFS (but forgot to discuss at LSF) is about the userspace API.  I
think it would make sense to reuse FUSE protocol definition and extend
it where needed.   That does not mean ZUFS would need to be 100%
backward compatible with FUSE, it would just mean that we'd have a
common userspace API and each implementation could implement a subset
of features.    I think this would be an immediate and significant
boon for ZUFS, since it would give it an already existing user/tester
base that it otherwise needs to build up.  It would also allow
filesystem implementation to be more easily switchable between the
kernel frameworks in case that's necessary.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-11-14 14:56       ` Miklos Szeredi
@ 2019-11-14 16:04         ` Boaz Harrosh
  2019-11-15  8:04           ` Miklos Szeredi
  0 siblings, 1 reply; 8+ messages in thread
From: Boaz Harrosh @ 2019-11-14 16:04 UTC (permalink / raw)
  To: Miklos Szeredi, Boaz Harrosh
  Cc: Stephen Rothwell, Christoph Hellwig, Linus Torvalds,
	Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On 14/11/2019 16:56, Miklos Szeredi wrote:
> On Thu, Nov 14, 2019 at 3:02 PM Boaz Harrosh <boaz@plexistor.com> wrote:
> 
>> At the last LSF. Steven from Red-Hat asked me to talk with Miklos about the fuse vs zufs.
>> We had a long talk where I have explained to him in detail How we do the mounting, how
>> Kernel owns the multy-devices. How we do the PMEM API and our IO API in general. How
>> we do pigi-back operations to minimize latencies. How we do DAX and mmap. At the end of the
>> talk he said to me that he understands how this is very different from FUSE and he wished
>> me "good luck".
>>
>> Miklos - you have seen both projects; do you think that All these new subsystems from ZUFS
>> can have a comfortable place under FUSE, including the new IO API?
> 
> It is quite true that ZUFS includes a lot of innovative ideas to
> improve the performance of a certain class of userspace filesystems.
> I think most, if not all of those ideas could be applied to the fuse
> implementation as well, 

This is not so:

- The way we do the mount is very different. It is not the Server that does
  The mount but the Kernel. So auto bind mount works (same device different dir)
- The way zuf owns the devices in the Kernel, and supports multi-devices.
  And has support for pmem devices as well as what we call t2 (regular) block
  devices. And the all API for transfer between them. (The all md.* thing).
  Proper locking of devices.
- The way we are true zero-copy both pmem and t2.
- The way we are DAX both pwrite and mmap.
- The way we are NUMA aware both Kernel and Server.
- The way we use shared memory pools that are deep in the protocol between
  Server and Kernel for zero copy of meta-data as well as protocol buffers.
- The way we do pigy-back of operations to save round-trips.
- The way we use cookies in Kernel of all Server objects so there are no
  i_ino hash tables or look-ups.
- The way we use a single Server with loadable FS modules. That the ZUSD comes
  with the distro and only the FS-pluging comes from Vendor. So Kernel=Server API
  is in sync.
- The way ZUFS supports root filesystem.
- The way ZUFS supports VM-FS to SHARE same p-memory as HOST-FS
- The way we do Zero-copy IO, both pmem and bdevs

> but I can understand why this hasn't been
> done.  Fuse is in serious need of a cleanup, which I've started to do,
> but it's not there yet...
> 

This will not be wise. It will be a complete FULL zuf code drop into the
current fuse code base (fuse is BTW bigger then zuf). I think this is the
Last thing fuse needs.

I know for a fact that the code of fuse+zuf will be bigger and slower than
those two Separate.

zufs is built from the ground up, built on all those subsystems as
building blocks. Putting all these things into fuse will actually be like
putting a pyramid on its head.

> One of the major issues that I brought up when originally reviewing
> ZUFS (but forgot to discuss at LSF) is about the userspace API.  I
> think it would make sense to reuse FUSE protocol definition and extend
> it where needed.   That does not mean ZUFS would need to be 100%
> backward compatible with FUSE, it would just mean that we'd have a
> common userspace API and each implementation could implement a subset
> of features.

This is easy to say. But believe me it is not possible. The shared structures
are maybe 20% and not 80% as the theory might feel about it. The projects are
really structured differently.

I have looked at it long and hard, Many times. I do not know how to this.
If I knew how I would.

These codes and systems do very different things. It will need tones of
if()s and operation changes. Sometimes you do a copy/paste of ext4 into
ffs2 and so on. Because the combination is not always the best and the
easiest.

> I think this would be an immediate and significant
> boon for ZUFS, since it would give it an already existing user/tester
> base that it otherwise needs to build up.  It would also allow
> filesystem implementation to be more easily switchable between the
> kernel frameworks in case that's necessary.
> 

Thanks Miklos for your input. I have looked at this problems many times.
This is not something that is interesting for me. Because these two projects
come to solve different things.

And it is not so easy to do as it sounds. There are fundamental difference
between the projects. For example in fuse main() belongs to the FS. That needs
to supply its own mount application. In ZUFS we do the regular Kernel's /sbin/mount.
Also ZUS User-mode server has a huge facility for allocating pages, mlocking,
per-cpu counters per-cpu variables, NUMA memory management. Thread management.
The API with zuf is very very particular about tons of things. Involving threads
and special files and mmap calls, and shared memory with Kernel. This will not be so
easily interchangeable.

> Thanks,
> Miklos
> 

Sometimes a fresh new code is much easier more maintainable and faster / more capable
then a do-it-all blob of code.
I am not sure if you actually looked at the code both Kernel and Server. This is not so easy
as it sounds. Even after a deep fuse cleanup.

Yes perhaps we could share some core code, like what sits in zuf-core.c and the relay object
but not more then that.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-11-14 16:04         ` Boaz Harrosh
@ 2019-11-15  8:04           ` Miklos Szeredi
  2019-11-18 15:44             ` Boaz Harrosh
  0 siblings, 1 reply; 8+ messages in thread
From: Miklos Szeredi @ 2019-11-15  8:04 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Stephen Rothwell, Christoph Hellwig, Linus Torvalds,
	Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On Thu, Nov 14, 2019 at 5:04 PM Boaz Harrosh <boaz@plexistor.com> wrote:
>
> On 14/11/2019 16:56, Miklos Szeredi wrote:
> > On Thu, Nov 14, 2019 at 3:02 PM Boaz Harrosh <boaz@plexistor.com> wrote:
> >
> >> At the last LSF. Steven from Red-Hat asked me to talk with Miklos about the fuse vs zufs.
> >> We had a long talk where I have explained to him in detail How we do the mounting, how
> >> Kernel owns the multy-devices. How we do the PMEM API and our IO API in general. How
> >> we do pigi-back operations to minimize latencies. How we do DAX and mmap. At the end of the
> >> talk he said to me that he understands how this is very different from FUSE and he wished
> >> me "good luck".
> >>
> >> Miklos - you have seen both projects; do you think that All these new subsystems from ZUFS
> >> can have a comfortable place under FUSE, including the new IO API?
> >
> > It is quite true that ZUFS includes a lot of innovative ideas to
> > improve the performance of a certain class of userspace filesystems.
> > I think most, if not all of those ideas could be applied to the fuse
> > implementation as well,
>
> This is not so:
>
> - The way we do the mount is very different. It is not the Server that does
>   The mount but the Kernel. So auto bind mount works (same device different dir)

This is not a significant difference.  I.e. the following could be
added to the fuse protocol to optionally operate this way:

- server registers filesystem at startup, does not perform any mount
(sends FUSE_NOTIFY_REGISTER)
- on mount kernel sends a FUSE_FS_LOOKUP message, server looks up or
creates filesystem instance and returns a filesystem ID
- filesystem ID is sent in further message headers (there's a 32bit
spare field where this fits nicely)

> - The way zuf owns the devices in the Kernel, and supports multi-devices.

Same as above, one server process could handle as many filesystem
instances (possibly of different type) as necessary.

>   And has support for pmem devices as well as what we call t2 (regular) block
>   devices. And the all API for transfer between them. (The all md.* thing).

Extending the protocol to pass reference to pmem or any other device
is certainly possible.  See the  FUSE2_DEV_IOC_MAP_OPEN in the
prototype.

>   Proper locking of devices.

Care to explain?

> - The way we are true zero-copy both pmem and t2.

See FUSE_MAP request in fuse2 prototype.

> - The way we are DAX both pwrite and mmap.

This is not implemented yet in the prototype, but there's nothing
preventing the mapping returned by the FUSE_MAP request to be cached
and used for mmap and  I/O without any further exchanges with server.

> - The way we are NUMA aware both Kernel and Server.

I've tested the prototype on huge NUMA systems, and it certainly was
very scalable.

> - The way we use shared memory pools that are deep in the protocol between
>   Server and Kernel for zero copy of meta-data as well as protocol buffers.

Again, the fuse2 prototype uses shared memory for communication, and
this helps (though not as much as CPU locality).

> - The way we do pigy-back of operations to save round-trips.

It is not difficult to extend the FUSE protocol to allow bundling of
several requests and replies.

> - The way we use cookies in Kernel of all Server objects so there are no
>   i_ino hash tables or look-ups.

I don't get that.  zuf_iget() calls iget_locked() which does the inode
hash lookup.

> - The way we use a single Server with loadable FS modules. That the ZUSD comes
>   with the distro and only the FS-pluging comes from Vendor. So Kernel=Server API
>   is in sync.

Same abstraction is provided by libfuse.  Pluggable fs modules are
also certainly possible, in fact libfuse already has something like
that: fuse_register_module().

> - The way ZUFS supports root filesystem.

Why is that a unique feature?

> - The way ZUFS supports VM-FS to SHARE same p-memory as HOST-FS
> - The way we do Zero-copy IO, both pmem and bdevs

I think these have been mentioned above already.

> > One of the major issues that I brought up when originally reviewing
> > ZUFS (but forgot to discuss at LSF) is about the userspace API.  I
> > think it would make sense to reuse FUSE protocol definition and extend
> > it where needed.   That does not mean ZUFS would need to be 100%
> > backward compatible with FUSE, it would just mean that we'd have a
> > common userspace API and each implementation could implement a subset
> > of features.
>
> This is easy to say. But believe me it is not possible. The shared structures
> are maybe 20% and not 80% as the theory might feel about it. The projects are
> really structured differently.

Well, I'm not saying it would be an easy job, just sthat doing a
rewrite with the already existing and well established API might well
pay off in the long run.

> I have looked at it long and hard, Many times. I do not know how to this.
> If I knew how I would.
>
> These codes and systems do very different things. It will need tones of
> if()s and operation changes. Sometimes you do a copy/paste of ext4 into
> ffs2 and so on. Because the combination is not always the best and the
> easiest.

Again, I'm not suggesting that you add zufs features to fuse.   I'm
suggesting that you implement zufs features with the fuse protocol,
extending it where needed, but keeping the basic format the same.

>
> > I think this would be an immediate and significant
> > boon for ZUFS, since it would give it an already existing user/tester
> > base that it otherwise needs to build up.  It would also allow
> > filesystem implementation to be more easily switchable between the
> > kernel frameworks in case that's necessary.
> >
>
> Thanks Miklos for your input. I have looked at this problems many times.
> This is not something that is interesting for me. Because these two projects
> come to solve different things.
>
> And it is not so easy to do as it sounds. There are fundamental difference
> between the projects. For example in fuse main() belongs to the FS. That needs
> to supply its own mount application. In ZUFS we do the regular Kernel's /sbin/mount.
> Also ZUS User-mode server has a huge facility for allocating pages, mlocking,
> per-cpu counters per-cpu variables, NUMA memory management. Thread management.
> The API with zuf is very very particular about tons of things. Involving threads
> and special files and mmap calls, and shared memory with Kernel. This will not be so
> easily interchangeable.

I hope to get around to do a review eventually.  API design is hard.
I know how many times I got it wrong in fuse, and how much pain that
has caused.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Please add the zuf tree to linux-next
  2019-11-15  8:04           ` Miklos Szeredi
@ 2019-11-18 15:44             ` Boaz Harrosh
  0 siblings, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2019-11-18 15:44 UTC (permalink / raw)
  To: Miklos Szeredi, Boaz Harrosh
  Cc: Stephen Rothwell, Christoph Hellwig, Linus Torvalds,
	Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On 15/11/2019 10:04, Miklos Szeredi wrote:
> On Thu, Nov 14, 2019 at 5:04 PM Boaz Harrosh <boaz@plexistor.com> wrote:
<>
>> - The way we do the mount is very different. It is not the Server that does
>>   The mount but the Kernel. So auto bind mount works (same device different dir)
> 
> This is not a significant difference.  I.e. the following could be
> added to the fuse protocol to optionally operate this way:
> 
> - server registers filesystem at startup, does not perform any mount
> (sends FUSE_NOTIFY_REGISTER)
> - on mount kernel sends a FUSE_FS_LOOKUP message, server looks up or
> creates filesystem instance and returns a filesystem ID
> - filesystem ID is sent in further message headers (there's a 32bit
> spare field where this fits nicely)
> 

OK

>> - The way zuf owns the devices in the Kernel, and supports multi-devices.
> 
> Same as above, one server process could handle as many filesystem
> instances (possibly of different type) as necessary.
> 

[md]
You misunderstood me. In zuf similar to btrfs. We support multiple devices
under the same supper-block via a device_table. Any device from the list
given on the command line will mount the all device_table in the correct
locking order. Including auto-bind mount. Any device given on command line
will find and loaded the same SB.

Once device_table is loaded the all t1 (pmem) space is presented as a single
linear address space to the Server. As well as the all t2 (non-pmem) device-space
is presented as one abstract linear array.

>>   And has support for pmem devices as well as what we call t2 (regular) block
>>   devices. And the all API for transfer between them. (The all md.* thing).
> 
> Extending the protocol to pass reference to pmem or any other device
> is certainly possible.  See the  FUSE2_DEV_IOC_MAP_OPEN in the
> prototype.
> 

This is new, not yet tested code that I believe was inspired by zufs?
Our ZUFS_IOC_IO is much much richer (Just because it is older), then
fuse's.

Our code is very stable and heavily tested. And runs at costumers sites.

Just one more reason why ZUFS should be in Kernel. Linux forte is because
of its diversity, and the way projects interchange ideas and code.
FUSE already gained so much from ZUFS. Why would we not have it in Kernel?

>>   Proper locking of devices.
> 
> Care to explain?
> 

See the [md] explanation above. Think of a race between:

mount /dev/pmem0 /foo
mount /dev/pmem1 /bar

But pmem0 && pmem1 belong to the same FS (under same SB). Can user-mode
resolve such a race? never. Only Kernel, one central point can.
Again see md.* files in the zuf project. This is important code.

>> - The way we are true zero-copy both pmem and t2.
> 
> See FUSE_MAP request in fuse2 prototype.
> 

Again very new code. Our is richer and older and very much stabilized.
And has some unique fixtures that can be only under zuf and the way it
is structured.

>> - The way we are DAX both pwrite and mmap.
> 
> This is not implemented yet in the prototype, but there's nothing
> preventing the mapping returned by the FUSE_MAP request to be cached
> and used for mmap and  I/O without any further exchanges with server.
> 

Again FUSE_MAP is newer code then ZUFS. And is yet lacking fixtures
in order to work for zufs and dax.

>> - The way we are NUMA aware both Kernel and Server.
> 
> I've tested the prototype on huge NUMA systems, and it certainly was
> very scalable.
> 

I am not sure you have ever implemented multy-numa pmem and multy-numa
RDMA NICs and NvME cards. These are not supported by FUSE and very
hard to implement by other Kernel APIs.

The md.h code is from the base NUMA aware and presents the server with
the full information it needs.

No other Filesystem in the world does that.

>> - The way we use shared memory pools that are deep in the protocol between
>>   Server and Kernel for zero copy of meta-data as well as protocol buffers.
> 
> Again, the fuse2 prototype uses shared memory for communication, and
> this helps (though not as much as CPU locality).
> 

Yes inspired by zufs? You said yourself "fuse2 prototype". Our code
is two years old is way passed prototype. Even passed alfa and beta
and runs at costumers data centers.

For the "fuse2 prototype" to support the special needs of ZUFS it will
need more changes still.

>> - The way we do pigy-back of operations to save round-trips.
> 
> It is not difficult to extend the FUSE protocol to allow bundling of
> several requests and replies.
> 

Again this is already done.

>> - The way we use cookies in Kernel of all Server objects so there are no
>>   i_ino hash tables or look-ups.
> 
> I don't get that.  zuf_iget() calls iget_locked() which does the inode
> hash lookup.
> 

Sorry I did not explain well. I mean in fuse communication passes an i_ino
to denote what file to write to. therefor userspace needs an hash-table to
look-up i_ino-to-FS-object at every API call?

In zufs we have an opaque struct zus_inode associated per kernel-inode so
the only hash is the Kernel hash. The same is with all other Server objects like
per-sb, per FS-register, xattrs and so on.

>> - The way we use a single Server with loadable FS modules. That the ZUSD comes
>>   with the distro and only the FS-pluging comes from Vendor. So Kernel=Server API
>>   is in sync.
> 
> Same abstraction is provided by libfuse.  Pluggable fs modules are
> also certainly possible, in fact libfuse already has something like
> that: fuse_register_module().
> 
 ---
>> - The way ZUFS supports root filesystem.
> 
> Why is that a unique feature?
> 

Can fuse be the root FS, I did not now? Can you install and boot a Fedora on it?

>> - The way ZUFS supports VM-FS to SHARE same p-memory as HOST-FS
>> - The way we do Zero-copy IO, both pmem and bdevs
> 
> I think these have been mentioned above already.
> 
 ---
<>
> Well, I'm not saying it would be an easy job, just sthat doing a
> rewrite with the already existing and well established API might well
> pay off in the long run.
> 

I think the opposite. I think the projects separate would be more stable
and less risky and less work. They do come to solve two opposite sides
of the problem spectrum. (See page-cache vs pmem)

bloating everything in one place is sometimes risky to the two sides.

<>
> 
> Again, I'm not suggesting that you add zufs features to fuse.   I'm
> suggesting that you implement zufs features with the fuse protocol,
> extending it where needed, but keeping the basic format the same.
> 

Sigh, FUSE has legacy I do not want. And the new stuff that I need
is in prototype stage and very big parts are still missing.
I still do not see the merits why keep them the same. The FS will need to
know.

I am not sure you are fully aware of the ZUFS API and what it enables.
An FS that supports both pmem and bdev devices under the same SB and
behind the scene migrates data from hot-to-cold or cold-to-hot storage
is hard to do. The lucking and racing takes a long time to master. The
DAX thing that ZUFS is doing is not so simple too.

I am the laziest person there is. Believe me. What you are suggesting is
much much more work. short term and long. And I do not see any other benefits.
Having all this extra bloat in fuse is not good for fuse users. And ....
Fuse will never be what zufs wants to be, because of legacy and structure

I do see a lot of merit to have both projects in Kernel and both
projects feed and inspire each other. Just as they already are.

<>
> 
> I hope to get around to do a review eventually.  API design is hard.
> I know how many times I got it wrong in fuse, and how much pain that
> has caused.
> 

True

> Thanks,
> Miklos
> 

Thanks Miklos. I will think some more about what you are saying.
Boaz

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-18 15:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com>
2019-10-24  2:36 ` Please add the zuf tree to linux-next Christoph Hellwig
2019-10-29  5:07   ` Stephen Rothwell
2019-10-29  5:53     ` Christoph Hellwig
2019-11-14 14:02     ` Boaz Harrosh
2019-11-14 14:56       ` Miklos Szeredi
2019-11-14 16:04         ` Boaz Harrosh
2019-11-15  8:04           ` Miklos Szeredi
2019-11-18 15:44             ` Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).