All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 14:52 ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2018-01-16 14:52 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-fsdevel, linux-block


I see the improvements that Facebook have been making to the nbd driver,
and I think that's a wonderful thing.  Maybe the outcome of this topic
is simply: "Shut up, Matthew, this is good enough".

It's clear that there's an appetite for userspace block devices; not for
swap devices or the root device, but for accessing data that's stored
in that silo over there, and I really don't want to bring that entire
mess of CORBA / Go / Rust / whatever into the kernel to get to it,
but it would be really handy to present it as a block device.

I've looked at a few block-driver-in-userspace projects that exist, and
they all seem pretty bad.  For example, one API maps a few gigabytes of
address space and plays games with vm_insert_page() to put page cache
pages into the address space of the client process.  Of course, the TLB
flush overhead of that solution is criminal.

I've looked at pipes, and they're not an awful solution.  We've almost
got enough syscalls to treat other objects as pipes.  The problem is
that they're not seekable.  So essentially you're looking at having one
pipe per outstanding command.  If yu want to make good use of a modern
NAND device, you want a few hundred outstanding commands, and that's a
bit of a shoddy interface.

Right now, I'm leaning towards combining these two approaches; adding
a VM_NOTLB flag so the mmaped bits of the page cache never make it into
the process's address space, so the TLB shootdown can be safely skipped.
Then check it in follow_page_mask() and return the appropriate struct
page.  As long as the userspace process does everything using O_DIRECT,
I think this will work.

It's either that or make pipes seekable ...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 14:52 ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2018-01-16 14:52 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-fsdevel, linux-block


I see the improvements that Facebook have been making to the nbd driver,
and I think that's a wonderful thing.  Maybe the outcome of this topic
is simply: "Shut up, Matthew, this is good enough".

It's clear that there's an appetite for userspace block devices; not for
swap devices or the root device, but for accessing data that's stored
in that silo over there, and I really don't want to bring that entire
mess of CORBA / Go / Rust / whatever into the kernel to get to it,
but it would be really handy to present it as a block device.

I've looked at a few block-driver-in-userspace projects that exist, and
they all seem pretty bad.  For example, one API maps a few gigabytes of
address space and plays games with vm_insert_page() to put page cache
pages into the address space of the client process.  Of course, the TLB
flush overhead of that solution is criminal.

I've looked at pipes, and they're not an awful solution.  We've almost
got enough syscalls to treat other objects as pipes.  The problem is
that they're not seekable.  So essentially you're looking at having one
pipe per outstanding command.  If yu want to make good use of a modern
NAND device, you want a few hundred outstanding commands, and that's a
bit of a shoddy interface.

Right now, I'm leaning towards combining these two approaches; adding
a VM_NOTLB flag so the mmaped bits of the page cache never make it into
the process's address space, so the TLB shootdown can be safely skipped.
Then check it in follow_page_mask() and return the appropriate struct
page.  As long as the userspace process does everything using O_DIRECT,
I think this will work.

It's either that or make pipes seekable ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 ` Matthew Wilcox
@ 2018-01-16 23:04   ` Viacheslav Dubeyko
  -1 siblings, 0 replies; 23+ messages in thread
From: Viacheslav Dubeyko @ 2018-01-16 23:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block

On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

I like the whole idea. But why pipes? What's about shared memory? To
make the pipes seekable sounds like the killing of initial concept.
Usually, we treat pipe as FIFO communication channel. So, to make the
pipe seekable sounds really strange, from my point of view. Maybe, we
need in some new abstraction?

By the way, what's use-case(s) you have in mind for the suggested
approach?

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 23:04   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 23+ messages in thread
From: Viacheslav Dubeyko @ 2018-01-16 23:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block

On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

I like the whole idea. But why pipes? What's about shared memory? To
make the pipes seekable sounds like the killing of initial concept.
Usually, we treat pipe as FIFO communication channel. So, to make the
pipe seekable sounds really strange, from my point of view. Maybe, we
need in some new abstraction?

By the way, what's use-case(s) you have in mind for the suggested
approach?

Thanks,
Vyacheslav Dubeyko.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 ` Matthew Wilcox
@ 2018-01-16 23:23   ` Theodore Ts'o
  -1 siblings, 0 replies; 23+ messages in thread
From: Theodore Ts'o @ 2018-01-16 23:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, lsf-pc, linux-block

On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> 
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

... and using iSCSI was too painful and heavyweight.

Google has an iblock device implementation, so you can use that as
confirmation that there certainly has been a desire for such a thing.
In fact, we're happily using it in production even as we speak.

We have been (tentatively) planning on presenting it at OSS North
America later in the year, since the Vault conference is no longer
with us, but we could probably put together a quick presentation for
LSF/MM if there is interest.

There were plans to do something using page cache tricks (what we were
calling the "zero copy" option), but we decided to start with
something simpler, more reliable, so long as it was less overhead and
pain than iSCSI (which was simply an over-engineered solution for our
use case), it was all upside.

						- Ted
_______________________________________________
Lsf-pc mailing list
Lsf-pc@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 23:23   ` Theodore Ts'o
  0 siblings, 0 replies; 23+ messages in thread
From: Theodore Ts'o @ 2018-01-16 23:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block

On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> 
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

... and using iSCSI was too painful and heavyweight.

Google has an iblock device implementation, so you can use that as
confirmation that there certainly has been a desire for such a thing.
In fact, we're happily using it in production even as we speak.

We have been (tentatively) planning on presenting it at OSS North
America later in the year, since the Vault conference is no longer
with us, but we could probably put together a quick presentation for
LSF/MM if there is interest.

There were plans to do something using page cache tricks (what we were
calling the "zero copy" option), but we decided to start with
something simpler, more reliable, so long as it was less overhead and
pain than iSCSI (which was simply an over-engineered solution for our
use case), it was all upside.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 23:23   ` Theodore Ts'o
  (?)
@ 2018-01-16 23:28     ` James Bottomley
  -1 siblings, 0 replies; 23+ messages in thread
From: James Bottomley @ 2018-01-16 23:28 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, lsf-pc, linux-block, linux-scsi

On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE.  The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 23:28     ` James Bottomley
  0 siblings, 0 replies; 23+ messages in thread
From: James Bottomley @ 2018-01-16 23:28 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, lsf-pc, linux-block, linux-scsi

On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE.  The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 23:28     ` James Bottomley
  0 siblings, 0 replies; 23+ messages in thread
From: James Bottomley @ 2018-01-16 23:28 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, lsf-pc, linux-block, linux-scsi

On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.A A Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE. A The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 23:28     ` James Bottomley
@ 2018-01-16 23:57       ` Bart Van Assche
  -1 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-01-16 23:57 UTC (permalink / raw)
  To: James.Bottomley, tytso, willy
  Cc: linux-block, linux-mm, lsf-pc, linux-fsdevel, linux-scsi

On Tue, 2018-01-16 at 15:28 -0800, James Bottomley wrote:
> On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> > On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > > 
> > > 
> > > I see the improvements that Facebook have been making to the nbd
> > > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > > this topic is simply: "Shut up, Matthew, this is good enough".
> > > 
> > > It's clear that there's an appetite for userspace block devices;
> > > not for swap devices or the root device, but for accessing data
> > > that's stored in that silo over there, and I really don't want to
> > > bring that entire mess of CORBA / Go / Rust / whatever into the
> > > kernel to get to it, but it would be really handy to present it as
> > > a block device.
> > 
> > ... and using iSCSI was too painful and heavyweight.
> 
> From what I've seen a reasonable number of storage over IP cloud
> implementations are actually using AoE.  The argument goes that the
> protocol is about ideal (at least as compared to iSCSI or FCoE) and the
> company behind it doesn't seem to want to add any more features that
> would bloat it.

Has anyone already looked into iSER, SRP or NVMeOF over rdma_rxe over the
loopback network driver? I think all three driver stacks support zero-copy
receiving, something that is not possible with iSCSI/TCP nor with AoE.

Bart.
_______________________________________________
Lsf-pc mailing list
Lsf-pc@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 23:57       ` Bart Van Assche
  0 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-01-16 23:57 UTC (permalink / raw)
  To: James.Bottomley, tytso, willy
  Cc: linux-scsi, linux-mm, linux-block, lsf-pc, linux-fsdevel

On Tue, 2018-01-16 at 15:28 -0800, James Bottomley wrote:
> On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> > On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > > 
> > > 
> > > I see the improvements that Facebook have been making to the nbd
> > > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > > this topic is simply: "Shut up, Matthew, this is good enough".
> > > 
> > > It's clear that there's an appetite for userspace block devices;
> > > not for swap devices or the root device, but for accessing data
> > > that's stored in that silo over there, and I really don't want to
> > > bring that entire mess of CORBA / Go / Rust / whatever into the
> > > kernel to get to it, but it would be really handy to present it as
> > > a block device.
> > 
> > ... and using iSCSI was too painful and heavyweight.
> 
> From what I've seen a reasonable number of storage over IP cloud
> implementations are actually using AoE.  The argument goes that the
> protocol is about ideal (at least as compared to iSCSI or FCoE) and the
> company behind it doesn't seem to want to add any more features that
> would bloat it.

Has anyone already looked into iSER, SRP or NVMeOF over rdma_rxe over the
loopback network driver? I think all three driver stacks support zero-copy
receiving, something that is not possible with iSCSI/TCP nor with AoE.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 ` Matthew Wilcox
@ 2018-01-17  0:41   ` Bart Van Assche
  -1 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-01-17  0:41 UTC (permalink / raw)
  To: lsf-pc, willy; +Cc: linux-mm, linux-block, linux-fsdevel

T24gVHVlLCAyMDE4LTAxLTE2IGF0IDA2OjUyIC0wODAwLCBNYXR0aGV3IFdpbGNveCB3cm90ZToN
Cj4gSSBzZWUgdGhlIGltcHJvdmVtZW50cyB0aGF0IEZhY2Vib29rIGhhdmUgYmVlbiBtYWtpbmcg
dG8gdGhlIG5iZCBkcml2ZXIsDQo+IGFuZCBJIHRoaW5rIHRoYXQncyBhIHdvbmRlcmZ1bCB0aGlu
Zy4gIE1heWJlIHRoZSBvdXRjb21lIG9mIHRoaXMgdG9waWMNCj4gaXMgc2ltcGx5OiAiU2h1dCB1
cCwgTWF0dGhldywgdGhpcyBpcyBnb29kIGVub3VnaCIuDQo+IA0KPiBJdCdzIGNsZWFyIHRoYXQg
dGhlcmUncyBhbiBhcHBldGl0ZSBmb3IgdXNlcnNwYWNlIGJsb2NrIGRldmljZXM7IG5vdCBmb3IN
Cj4gc3dhcCBkZXZpY2VzIG9yIHRoZSByb290IGRldmljZSwgYnV0IGZvciBhY2Nlc3NpbmcgZGF0
YSB0aGF0J3Mgc3RvcmVkDQo+IGluIHRoYXQgc2lsbyBvdmVyIHRoZXJlLCBhbmQgSSByZWFsbHkg
ZG9uJ3Qgd2FudCB0byBicmluZyB0aGF0IGVudGlyZQ0KPiBtZXNzIG9mIENPUkJBIC8gR28gLyBS
dXN0IC8gd2hhdGV2ZXIgaW50byB0aGUga2VybmVsIHRvIGdldCB0byBpdCwNCj4gYnV0IGl0IHdv
dWxkIGJlIHJlYWxseSBoYW5keSB0byBwcmVzZW50IGl0IGFzIGEgYmxvY2sgZGV2aWNlLg0KPiAN
Cj4gSSd2ZSBsb29rZWQgYXQgYSBmZXcgYmxvY2stZHJpdmVyLWluLXVzZXJzcGFjZSBwcm9qZWN0
cyB0aGF0IGV4aXN0LCBhbmQNCj4gdGhleSBhbGwgc2VlbSBwcmV0dHkgYmFkLiAgRm9yIGV4YW1w
bGUsIG9uZSBBUEkgbWFwcyBhIGZldyBnaWdhYnl0ZXMgb2YNCj4gYWRkcmVzcyBzcGFjZSBhbmQg
cGxheXMgZ2FtZXMgd2l0aCB2bV9pbnNlcnRfcGFnZSgpIHRvIHB1dCBwYWdlIGNhY2hlDQo+IHBh
Z2VzIGludG8gdGhlIGFkZHJlc3Mgc3BhY2Ugb2YgdGhlIGNsaWVudCBwcm9jZXNzLiAgT2YgY291
cnNlLCB0aGUgVExCDQo+IGZsdXNoIG92ZXJoZWFkIG9mIHRoYXQgc29sdXRpb24gaXMgY3JpbWlu
YWwuDQo+IA0KPiBJJ3ZlIGxvb2tlZCBhdCBwaXBlcywgYW5kIHRoZXkncmUgbm90IGFuIGF3ZnVs
IHNvbHV0aW9uLiAgV2UndmUgYWxtb3N0DQo+IGdvdCBlbm91Z2ggc3lzY2FsbHMgdG8gdHJlYXQg
b3RoZXIgb2JqZWN0cyBhcyBwaXBlcy4gIFRoZSBwcm9ibGVtIGlzDQo+IHRoYXQgdGhleSdyZSBu
b3Qgc2Vla2FibGUuICBTbyBlc3NlbnRpYWxseSB5b3UncmUgbG9va2luZyBhdCBoYXZpbmcgb25l
DQo+IHBpcGUgcGVyIG91dHN0YW5kaW5nIGNvbW1hbmQuICBJZiB5dSB3YW50IHRvIG1ha2UgZ29v
ZCB1c2Ugb2YgYSBtb2Rlcm4NCj4gTkFORCBkZXZpY2UsIHlvdSB3YW50IGEgZmV3IGh1bmRyZWQg
b3V0c3RhbmRpbmcgY29tbWFuZHMsIGFuZCB0aGF0J3MgYQ0KPiBiaXQgb2YgYSBzaG9kZHkgaW50
ZXJmYWNlLg0KPiANCj4gUmlnaHQgbm93LCBJJ20gbGVhbmluZyB0b3dhcmRzIGNvbWJpbmluZyB0
aGVzZSB0d28gYXBwcm9hY2hlczsgYWRkaW5nDQo+IGEgVk1fTk9UTEIgZmxhZyBzbyB0aGUgbW1h
cGVkIGJpdHMgb2YgdGhlIHBhZ2UgY2FjaGUgbmV2ZXIgbWFrZSBpdCBpbnRvDQo+IHRoZSBwcm9j
ZXNzJ3MgYWRkcmVzcyBzcGFjZSwgc28gdGhlIFRMQiBzaG9vdGRvd24gY2FuIGJlIHNhZmVseSBz
a2lwcGVkLg0KPiBUaGVuIGNoZWNrIGl0IGluIGZvbGxvd19wYWdlX21hc2soKSBhbmQgcmV0dXJu
IHRoZSBhcHByb3ByaWF0ZSBzdHJ1Y3QNCj4gcGFnZS4gIEFzIGxvbmcgYXMgdGhlIHVzZXJzcGFj
ZSBwcm9jZXNzIGRvZXMgZXZlcnl0aGluZyB1c2luZyBPX0RJUkVDVCwNCj4gSSB0aGluayB0aGlz
IHdpbGwgd29yay4NCj4gDQo+IEl0J3MgZWl0aGVyIHRoYXQgb3IgbWFrZSBwaXBlcyBzZWVrYWJs
ZSAuLi4NCg0KSG93IGFib3V0IHVzaW5nIHRoZSBSRE1BIEFQSSBhbmQgdGhlIHJkbWFfcnhlIGRy
aXZlciBvdmVyIGxvb3BiYWNrPyBUaGUgUkRNQQ0KQVBJIHN1cHBvcnRzIHplcm8tY29weSBjb21t
dW5pY2F0aW9uIHdoaWNoIGlzIHNvbWV0aGluZyB0aGUgQlNEIHNvY2tldCBBUEkNCmRvZXMgbm90
IHN1cHBvcnQuIFRoZSBSRE1BIEFQSSBhbHNvIHN1cHBvcnRzIGJ5dGUtbGV2ZWwgZ3JhbnVsYXJp
dHkgYW5kIHRoZQ0KaG90IHBhdGggKGliX3Bvc3Rfc2VuZCgpLCBpYl9wb3N0X3JlY3YoKSwgaWJf
cG9sbF9jcSgpKSBkb2VzIG5vdCByZXF1aXJlIGFueQ0Kc3lzdGVtIGNhbGxzIGZvciBQQ0llIFJE
TUEgYWRhcHRlcnMuIFRoZSByZG1hX3J4ZSBkcml2ZXIgaG93ZXZlciB1c2VzIGEgc3lzdGVtDQpj
YWxsIHRvIHRyaWdnZXIgdGhlIHNlbmQgZG9vcmJlbGwuDQoNCkJhcnQuDQo=

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-17  0:41   ` Bart Van Assche
  0 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-01-17  0:41 UTC (permalink / raw)
  To: lsf-pc, willy; +Cc: linux-mm, linux-block, linux-fsdevel

On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

How about using the RDMA API and the rdma_rxe driver over loopback? The RDMA
API supports zero-copy communication which is something the BSD socket API
does not support. The RDMA API also supports byte-level granularity and the
hot path (ib_post_send(), ib_post_recv(), ib_poll_cq()) does not require any
system calls for PCIe RDMA adapters. The rdma_rxe driver however uses a system
call to trigger the send doorbell.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 ` Matthew Wilcox
@ 2018-01-17  2:49   ` Ming Lei
  -1 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-17  2:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Tue, Jan 16, 2018 at 10:52 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

I like the idea, and one line code of Python/... may need thousands
of C code to be done in kernel.

>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...

Userfaultfd might be another choice:

1) map the block LBA space into a range of process vm space

2) when READ/WRITE req comes, convert it to page fault on the
mapped range, and let userland to take control of it, and meantime
kernel req context is slept

3) IO req context in kernel side is waken up after userspace completed
the IO request via userfaultfd

4) kernel side continue to complete the IO, such as copying page from
storage range to req(bio) pages.

Seems READ should be fine since it is very similar with the use case
of QEMU postcopy live migration, WRITE can be a bit different, and
maybe need some change on userfaultfd.

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-17  2:49   ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-17  2:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Tue, Jan 16, 2018 at 10:52 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

I like the idea, and one line code of Python/... may need thousands
of C code to be done in kernel.

>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...

Userfaultfd might be another choice:

1) map the block LBA space into a range of process vm space

2) when READ/WRITE req comes, convert it to page fault on the
mapped range, and let userland to take control of it, and meantime
kernel req context is slept

3) IO req context in kernel side is waken up after userspace completed
the IO request via userfaultfd

4) kernel side continue to complete the IO, such as copying page from
storage range to req(bio) pages.

Seems READ should be fine since it is very similar with the use case
of QEMU postcopy live migration, WRITE can be a bit different, and
maybe need some change on userfaultfd.

-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17  2:49   ` Ming Lei
@ 2018-01-17 21:21     ` Matthew Wilcox
  -1 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2018-01-17 21:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> Userfaultfd might be another choice:
> 
> 1) map the block LBA space into a range of process vm space

That would limit the size of a block device to ~200TB (with my laptop's
CPU).  That's probably OK for most users, but I suspect there are some
who would chafe at such a restriction (before the 57-bit CPUs arrive).

> 2) when READ/WRITE req comes, convert it to page fault on the
> mapped range, and let userland to take control of it, and meantime
> kernel req context is slept

You don't want to sleep the request; you want it to be able to submit
more I/O.  But we have infrastructure in place to inform the submitter
when I/Os have completed.

> 3) IO req context in kernel side is waken up after userspace completed
> the IO request via userfaultfd
> 
> 4) kernel side continue to complete the IO, such as copying page from
> storage range to req(bio) pages.
> 
> Seems READ should be fine since it is very similar with the use case
> of QEMU postcopy live migration, WRITE can be a bit different, and
> maybe need some change on userfaultfd.

I like this idea, and maybe extending UFFD is the way to solve this
problem.  Perhaps I should explain a little more what the requirements
are.  At the point the driver gets the I/O, pages to copy data into (for
a read) or copy data from (for a write) have already been allocated.
At all costs, we need to avoid playing VM tricks (because TLB flushes
are expensive).  So one copy is probably OK, but we'd like to avoid it
if reasonable.

Let's assume that the userspace program looks at the request metadata and
decides that it needs to send a network request.  Ideally, it would find
a way to have the data from the response land in the pre-allocated pages
(for a read) or send the data straight from the pages in the request
(for a write).  I'm not sure UFFD helps us with that part of the problem.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-17 21:21     ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2018-01-17 21:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> Userfaultfd might be another choice:
> 
> 1) map the block LBA space into a range of process vm space

That would limit the size of a block device to ~200TB (with my laptop's
CPU).  That's probably OK for most users, but I suspect there are some
who would chafe at such a restriction (before the 57-bit CPUs arrive).

> 2) when READ/WRITE req comes, convert it to page fault on the
> mapped range, and let userland to take control of it, and meantime
> kernel req context is slept

You don't want to sleep the request; you want it to be able to submit
more I/O.  But we have infrastructure in place to inform the submitter
when I/Os have completed.

> 3) IO req context in kernel side is waken up after userspace completed
> the IO request via userfaultfd
> 
> 4) kernel side continue to complete the IO, such as copying page from
> storage range to req(bio) pages.
> 
> Seems READ should be fine since it is very similar with the use case
> of QEMU postcopy live migration, WRITE can be a bit different, and
> maybe need some change on userfaultfd.

I like this idea, and maybe extending UFFD is the way to solve this
problem.  Perhaps I should explain a little more what the requirements
are.  At the point the driver gets the I/O, pages to copy data into (for
a read) or copy data from (for a write) have already been allocated.
At all costs, we need to avoid playing VM tricks (because TLB flushes
are expensive).  So one copy is probably OK, but we'd like to avoid it
if reasonable.

Let's assume that the userspace program looks at the request metadata and
decides that it needs to send a network request.  Ideally, it would find
a way to have the data from the response land in the pre-allocated pages
(for a read) or send the data straight from the pages in the request
(for a write).  I'm not sure UFFD helps us with that part of the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 ` Matthew Wilcox
@ 2018-01-18  5:27   ` Figo.zhang
  -1 siblings, 0 replies; 23+ messages in thread
From: Figo.zhang @ 2018-01-18  5:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Linux MM, lsf-pc, linux-block

MjAxOC0wMS0xNiAyMjo1MiBHTVQrMDg6MDAgTWF0dGhldyBXaWxjb3ggPHdpbGx5QGluZnJhZGVh
ZC5vcmc+OgoKPgo+IEkgc2VlIHRoZSBpbXByb3ZlbWVudHMgdGhhdCBGYWNlYm9vayBoYXZlIGJl
ZW4gbWFraW5nIHRvIHRoZSBuYmQgZHJpdmVyLAo+IGFuZCBJIHRoaW5rIHRoYXQncyBhIHdvbmRl
cmZ1bCB0aGluZy4gIE1heWJlIHRoZSBvdXRjb21lIG9mIHRoaXMgdG9waWMKPiBpcyBzaW1wbHk6
ICJTaHV0IHVwLCBNYXR0aGV3LCB0aGlzIGlzIGdvb2QgZW5vdWdoIi4KPgo+IEl0J3MgY2xlYXIg
dGhhdCB0aGVyZSdzIGFuIGFwcGV0aXRlIGZvciB1c2Vyc3BhY2UgYmxvY2sgZGV2aWNlczsgbm90
IGZvcgo+IHN3YXAgZGV2aWNlcyBvciB0aGUgcm9vdCBkZXZpY2UsIGJ1dCBmb3IgYWNjZXNzaW5n
IGRhdGEgdGhhdCdzIHN0b3JlZAo+IGluIHRoYXQgc2lsbyBvdmVyIHRoZXJlLCBhbmQgSSByZWFs
bHkgZG9uJ3Qgd2FudCB0byBicmluZyB0aGF0IGVudGlyZQo+IG1lc3Mgb2YgQ09SQkEgLyBHbyAv
IFJ1c3QgLyB3aGF0ZXZlciBpbnRvIHRoZSBrZXJuZWwgdG8gZ2V0IHRvIGl0LAo+IGJ1dCBpdCB3
b3VsZCBiZSByZWFsbHkgaGFuZHkgdG8gcHJlc2VudCBpdCBhcyBhIGJsb2NrIGRldmljZS4KPgo+
IEkndmUgbG9va2VkIGF0IGEgZmV3IGJsb2NrLWRyaXZlci1pbi11c2Vyc3BhY2UgcHJvamVjdHMg
dGhhdCBleGlzdCwgYW5kCj4gdGhleSBhbGwgc2VlbSBwcmV0dHkgYmFkLgoKCmhvdyBhYm91dCB0
aGUgU1BES++8nwoKCj4gRm9yIGV4YW1wbGUsIG9uZSBBUEkgbWFwcyBhIGZldyBnaWdhYnl0ZXMg
b2YKPiBhZGRyZXNzIHNwYWNlIGFuZCBwbGF5cyBnYW1lcyB3aXRoIHZtX2luc2VydF9wYWdlKCkg
dG8gcHV0IHBhZ2UgY2FjaGUKPiBwYWdlcyBpbnRvIHRoZSBhZGRyZXNzIHNwYWNlIG9mIHRoZSBj
bGllbnQgcHJvY2Vzcy4gIE9mIGNvdXJzZSwgdGhlIFRMQgo+IGZsdXNoIG92ZXJoZWFkIG9mIHRo
YXQgc29sdXRpb24gaXMgY3JpbWluYWwuCj4KPiBJJ3ZlIGxvb2tlZCBhdCBwaXBlcywgYW5kIHRo
ZXkncmUgbm90IGFuIGF3ZnVsIHNvbHV0aW9uLiAgV2UndmUgYWxtb3N0Cj4gZ290IGVub3VnaCBz
eXNjYWxscyB0byB0cmVhdCBvdGhlciBvYmplY3RzIGFzIHBpcGVzLiAgVGhlIHByb2JsZW0gaXMK
PiB0aGF0IHRoZXkncmUgbm90IHNlZWthYmxlLiAgU28gZXNzZW50aWFsbHkgeW91J3JlIGxvb2tp
bmcgYXQgaGF2aW5nIG9uZQo+IHBpcGUgcGVyIG91dHN0YW5kaW5nIGNvbW1hbmQuICBJZiB5dSB3
YW50IHRvIG1ha2UgZ29vZCB1c2Ugb2YgYSBtb2Rlcm4KPiBOQU5EIGRldmljZSwgeW91IHdhbnQg
YSBmZXcgaHVuZHJlZCBvdXRzdGFuZGluZyBjb21tYW5kcywgYW5kIHRoYXQncyBhCj4gYml0IG9m
IGEgc2hvZGR5IGludGVyZmFjZS4KPgo+IFJpZ2h0IG5vdywgSSdtIGxlYW5pbmcgdG93YXJkcyBj
b21iaW5pbmcgdGhlc2UgdHdvIGFwcHJvYWNoZXM7IGFkZGluZwo+IGEgVk1fTk9UTEIgZmxhZyBz
byB0aGUgbW1hcGVkIGJpdHMgb2YgdGhlIHBhZ2UgY2FjaGUgbmV2ZXIgbWFrZSBpdCBpbnRvCj4g
dGhlIHByb2Nlc3MncyBhZGRyZXNzIHNwYWNlLCBzbyB0aGUgVExCIHNob290ZG93biBjYW4gYmUg
c2FmZWx5IHNraXBwZWQuCj4gVGhlbiBjaGVjayBpdCBpbiBmb2xsb3dfcGFnZV9tYXNrKCkgYW5k
IHJldHVybiB0aGUgYXBwcm9wcmlhdGUgc3RydWN0Cj4gcGFnZS4gIEFzIGxvbmcgYXMgdGhlIHVz
ZXJzcGFjZSBwcm9jZXNzIGRvZXMgZXZlcnl0aGluZyB1c2luZyBPX0RJUkVDVCwKPiBJIHRoaW5r
IHRoaXMgd2lsbCB3b3JrLgo+Cj4gSXQncyBlaXRoZXIgdGhhdCBvciBtYWtlIHBpcGVzIHNlZWth
YmxlIC4uLgo+Cj4gLS0KPiBUbyB1bnN1YnNjcmliZSwgc2VuZCBhIG1lc3NhZ2Ugd2l0aCAndW5z
dWJzY3JpYmUgbGludXgtbW0nIGluCj4gdGhlIGJvZHkgdG8gbWFqb3Jkb21vQGt2YWNrLm9yZy4g
IEZvciBtb3JlIGluZm8gb24gTGludXggTU0sCj4gc2VlOiBodHRwOi8vd3d3LmxpbnV4LW1tLm9y
Zy8gLgo+IERvbid0IGVtYWlsOiA8YSBocmVmPW1haWx0bzoiZG9udEBrdmFjay5vcmciPiBlbWFp
bEBrdmFjay5vcmcgPC9hPgo+Cl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fCkxzZi1wYyBtYWlsaW5nIGxpc3QKTHNmLXBjQGxpc3RzLmxpbnV4LWZvdW5kYXRp
b24ub3JnCmh0dHBzOi8vbGlzdHMubGludXhmb3VuZGF0aW9uLm9yZy9tYWlsbWFuL2xpc3RpbmZv
L2xzZi1wYwo=

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-18  5:27   ` Figo.zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Figo.zhang @ 2018-01-18  5:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux MM, linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

2018-01-16 22:52 GMT+08:00 Matthew Wilcox <willy@infradead.org>:

>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.


how about the SPDK?


> For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 2939 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17 21:21     ` Matthew Wilcox
@ 2018-01-22 12:02       ` Mike Rapoport
  -1 siblings, 0 replies; 23+ messages in thread
From: Mike Rapoport @ 2018-01-22 12:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ming Lei, lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 01:21:44PM -0800, Matthew Wilcox wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> > Userfaultfd might be another choice:
> > 
> > 1) map the block LBA space into a range of process vm space
> 
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).
> 
> > 2) when READ/WRITE req comes, convert it to page fault on the
> > mapped range, and let userland to take control of it, and meantime
> > kernel req context is slept
> 
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

It's possible to queue IO requests and have a kthread that will convert
those requests to page faults. The thread indeed will sleep on each page
fault, though.
 
> > 3) IO req context in kernel side is waken up after userspace completed
> > the IO request via userfaultfd
> > 
> > 4) kernel side continue to complete the IO, such as copying page from
> > storage range to req(bio) pages.
> > 
> > Seems READ should be fine since it is very similar with the use case
> > of QEMU postcopy live migration, WRITE can be a bit different, and
> > maybe need some change on userfaultfd.
> 
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.
> 
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.

As of now it does not. UFFD allocates pages when userland asks to copy the
data into UFFD controlled VMA.
In your example, after the data had arrives from the network userland it
can be copied into a page UFFD will allocate.

Unrelated to block device, I've been thinking of implementing splice for
userfaultfd...

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-22 12:02       ` Mike Rapoport
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Rapoport @ 2018-01-22 12:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ming Lei, lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 01:21:44PM -0800, Matthew Wilcox wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> > Userfaultfd might be another choice:
> > 
> > 1) map the block LBA space into a range of process vm space
> 
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).
> 
> > 2) when READ/WRITE req comes, convert it to page fault on the
> > mapped range, and let userland to take control of it, and meantime
> > kernel req context is slept
> 
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

It's possible to queue IO requests and have a kthread that will convert
those requests to page faults. The thread indeed will sleep on each page
fault, though.
 
> > 3) IO req context in kernel side is waken up after userspace completed
> > the IO request via userfaultfd
> > 
> > 4) kernel side continue to complete the IO, such as copying page from
> > storage range to req(bio) pages.
> > 
> > Seems READ should be fine since it is very similar with the use case
> > of QEMU postcopy live migration, WRITE can be a bit different, and
> > maybe need some change on userfaultfd.
> 
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.
> 
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.

As of now it does not. UFFD allocates pages when userland asks to copy the
data into UFFD controlled VMA.
In your example, after the data had arrives from the network userland it
can be copied into a page UFFD will allocate.

Unrelated to block device, I've been thinking of implementing splice for
userfaultfd...

-- 
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17 21:21     ` Matthew Wilcox
@ 2018-01-22 12:18       ` Ming Lei
  -1 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22 12:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Linux FS Devel, linux-mm, lsf-pc, linux-block

On Thu, Jan 18, 2018 at 5:21 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
>> Userfaultfd might be another choice:
>>
>> 1) map the block LBA space into a range of process vm space
>
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).

In theory, it won't be a issue, since the LBA space can be partitioned into
more than one process's vm space, so no matter what the size of block device
is, this way should work.

>
>> 2) when READ/WRITE req comes, convert it to page fault on the
>> mapped range, and let userland to take control of it, and meantime
>> kernel req context is slept
>
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

Yes, the current bio completion(.end_bio) model can be respected, and
this issue(where to sleep) may depend on UFFD's read/POLLIN protocol.

>
>> 3) IO req context in kernel side is waken up after userspace completed
>> the IO request via userfaultfd
>>
>> 4) kernel side continue to complete the IO, such as copying page from
>> storage range to req(bio) pages.
>>
>> Seems READ should be fine since it is very similar with the use case
>> of QEMU postcopy live migration, WRITE can be a bit different, and
>> maybe need some change on userfaultfd.
>
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.

I agree, and one time of page copy can be easier to implement.

>
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.


-- 
Ming Lei
_______________________________________________
Lsf-pc mailing list
Lsf-pc@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-22 12:18       ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22 12:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Thu, Jan 18, 2018 at 5:21 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
>> Userfaultfd might be another choice:
>>
>> 1) map the block LBA space into a range of process vm space
>
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).

In theory, it won't be a issue, since the LBA space can be partitioned into
more than one process's vm space, so no matter what the size of block device
is, this way should work.

>
>> 2) when READ/WRITE req comes, convert it to page fault on the
>> mapped range, and let userland to take control of it, and meantime
>> kernel req context is slept
>
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

Yes, the current bio completion(.end_bio) model can be respected, and
this issue(where to sleep) may depend on UFFD's read/POLLIN protocol.

>
>> 3) IO req context in kernel side is waken up after userspace completed
>> the IO request via userfaultfd
>>
>> 4) kernel side continue to complete the IO, such as copying page from
>> storage range to req(bio) pages.
>>
>> Seems READ should be fine since it is very similar with the use case
>> of QEMU postcopy live migration, WRITE can be a bit different, and
>> maybe need some change on userfaultfd.
>
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.

I agree, and one time of page copy can be easier to implement.

>
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.


-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-01-22 12:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
2018-01-16 14:52 ` Matthew Wilcox
2018-01-16 23:04 ` Viacheslav Dubeyko
2018-01-16 23:04   ` Viacheslav Dubeyko
2018-01-16 23:23 ` [Lsf-pc] " Theodore Ts'o
2018-01-16 23:23   ` Theodore Ts'o
2018-01-16 23:28   ` [Lsf-pc] " James Bottomley
2018-01-16 23:28     ` James Bottomley
2018-01-16 23:28     ` James Bottomley
2018-01-16 23:57     ` Bart Van Assche
2018-01-16 23:57       ` Bart Van Assche
2018-01-17  0:41 ` Bart Van Assche
2018-01-17  0:41   ` Bart Van Assche
2018-01-17  2:49 ` Ming Lei
2018-01-17  2:49   ` Ming Lei
2018-01-17 21:21   ` Matthew Wilcox
2018-01-17 21:21     ` Matthew Wilcox
2018-01-22 12:02     ` Mike Rapoport
2018-01-22 12:02       ` Mike Rapoport
2018-01-22 12:18     ` [Lsf-pc] " Ming Lei
2018-01-22 12:18       ` Ming Lei
2018-01-18  5:27 ` [Lsf-pc] " Figo.zhang
2018-01-18  5:27   ` Figo.zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.