All of lore.kernel.org
 help / color / mirror / Atom feed
* Persistent memory interface
@ 2015-06-19 16:33 Mikulas Patocka
  2015-06-19 16:48 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Mikulas Patocka @ 2015-06-19 16:33 UTC (permalink / raw)
  To: Ross Zwisler, Christoph Hellwig, linux-nvdimm
  Cc: Boaz Harrosh, Keith Busch, Andy Lutomirski, Jens Axboe, dm-devel,
	Borislav Petkov, H. Peter Anvin, Matthew Wilcox, Andrew Morton,
	Thomas Gleixner

Hi

I looked at the new the persistent memory block device driver 
(drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the 
interface between them is incorrect.

If I want to use persistent memory in another driver, for a different 
purpose, how can I make sure that that drivers/block/pmem.c doesn't attach 
to this piece of memory and export it? It seems not possible. 
drivers/block/pmem.c attaches to everything without regard that there may 
be other users of persistent memory.

I think a correct solution would be to add a partition table at the 
beginning of persistent memory area and this partition table would 
describe which parts belong to which programs - so that different programs 
could use persistent memory and not step over each other's data. Is there 
some effort to standardize the partition table ongoing?


BTW. some journaling filesystems assume that 512-byte sector is written 
atomically. drivers/block/pmem.c breaks this requirement. Persistent 
memory only gurantees 8-byte atomic writes.

Mikulas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:33 Persistent memory interface Mikulas Patocka
@ 2015-06-19 16:48 ` Matthew Wilcox
  2015-06-19 16:52   ` Mikulas Patocka
  2015-06-19 16:50 ` Verma, Vishal L
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-06-19 16:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Boaz Harrosh, Thomas Gleixner, Andrew Morton, linux-nvdimm,
	Keith Busch, Andy Lutomirski, Jens Axboe, dm-devel,
	Borislav Petkov, H. Peter Anvin, Ross Zwisler, Christoph Hellwig

On Fri, Jun 19, 2015 at 12:33:09PM -0400, Mikulas Patocka wrote:
> If I want to use persistent memory in another driver, for a different 
> purpose, how can I make sure that that drivers/block/pmem.c doesn't attach 
> to this piece of memory and export it? It seems not possible. 
> drivers/block/pmem.c attaches to everything without regard that there may 
> be other users of persistent memory.

You need to look at the patches that Dan has been posting to add support
for Labels (and NFIT and ...).  Those weren't available earlier because
the label spec wasn't public.

> BTW. some journaling filesystems assume that 512-byte sector is written 
> atomically. drivers/block/pmem.c breaks this requirement. Persistent 
> memory only gurantees 8-byte atomic writes.

... and that's addressed by the BTT patches.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:33 Persistent memory interface Mikulas Patocka
  2015-06-19 16:48 ` Matthew Wilcox
@ 2015-06-19 16:50 ` Verma, Vishal L
  2015-06-22 16:50   ` Doug Dumitru
  2015-06-19 16:53 ` Doug Dumitru
  2015-06-19 17:35 ` Dan Williams
  3 siblings, 1 reply; 15+ messages in thread
From: Verma, Vishal L @ 2015-06-19 16:50 UTC (permalink / raw)
  To: mpatocka
  Cc: Busch, Keith, akpm, linux-nvdimm, luto, axboe, dm-devel, bp, hpa,
	tglx, ross.zwisler, hch

On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
> Hi
> 
> I looked at the new the persistent memory block device driver 
> (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the 
> interface between them is incorrect.
> 
> If I want to use persistent memory in another driver, for a different 
> purpose, how can I make sure that that drivers/block/pmem.c doesn't attach 
> to this piece of memory and export it? It seems not possible. 
> drivers/block/pmem.c attaches to everything without regard that there may 
> be other users of persistent memory.
> 
> I think a correct solution would be to add a partition table at the 
> beginning of persistent memory area and this partition table would 
> describe which parts belong to which programs - so that different programs 
> could use persistent memory and not step over each other's data. Is there 
> some effort to standardize the partition table ongoing?
> 
> 
> BTW. some journaling filesystems assume that 512-byte sector is written 
> atomically. drivers/block/pmem.c breaks this requirement. Persistent 
> memory only gurantees 8-byte atomic writes.

Hi Mikulas,

I can answer this part - The idea is that file systems that need sector
atomicity will use the "Block Translation Table" (BTT) [1]. It would be
a stacked block device on top of a pmem device (or partition), and file
systems would be able to use it either for the entire space to get
atomicity for all blocks, or if they want to use DAX, make two
partitions, and enable the BTT only on one partition, and use it as the
logdev.

	-Vishal

[1]: https://lkml.org/lkml/2015/6/17/950

> 
> Mikulas
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:48 ` Matthew Wilcox
@ 2015-06-19 16:52   ` Mikulas Patocka
  0 siblings, 0 replies; 15+ messages in thread
From: Mikulas Patocka @ 2015-06-19 16:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-nvdimm, dm-devel



On Fri, 19 Jun 2015, Matthew Wilcox wrote:

> On Fri, Jun 19, 2015 at 12:33:09PM -0400, Mikulas Patocka wrote:
> > If I want to use persistent memory in another driver, for a different 
> > purpose, how can I make sure that that drivers/block/pmem.c doesn't attach 
> > to this piece of memory and export it? It seems not possible. 
> > drivers/block/pmem.c attaches to everything without regard that there may 
> > be other users of persistent memory.
> 
> You need to look at the patches that Dan has been posting to add support
> for Labels (and NFIT and ...).  Those weren't available earlier because
> the label spec wasn't public.

Please, give me a link to those patches.

Mikulas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:33 Persistent memory interface Mikulas Patocka
  2015-06-19 16:48 ` Matthew Wilcox
  2015-06-19 16:50 ` Verma, Vishal L
@ 2015-06-19 16:53 ` Doug Dumitru
  2015-06-19 17:35 ` Dan Williams
  3 siblings, 0 replies; 15+ messages in thread
From: Doug Dumitru @ 2015-06-19 16:53 UTC (permalink / raw)
  To: device-mapper development
  Cc: Boaz Harrosh, Thomas Gleixner, Andrew Morton, Jens Axboe,
	linux-nvdimm, Andy Lutomirski, Keith Busch, Borislav Petkov,
	H. Peter Anvin, Matthew Wilcox, Ross Zwisler, Christoph Hellwig


[-- Attachment #1.1: Type: text/plain, Size: 1501 bytes --]

Mikulas,

Can you send me a link to this driver.  I need to implement a "real" and
"correct" driver for a PMC Sierra card, so I need to pay attention to the
atomic issue you described.

Doug Dumitru
EasyCo LLC


On Fri, Jun 19, 2015 at 9:33 AM, Mikulas Patocka <mpatocka@redhat.com>
wrote:

> Hi
>
> I looked at the new the persistent memory block device driver
> (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
> interface between them is incorrect.
>
> If I want to use persistent memory in another driver, for a different
> purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
> to this piece of memory and export it? It seems not possible.
> drivers/block/pmem.c attaches to everything without regard that there may
> be other users of persistent memory.
>
> I think a correct solution would be to add a partition table at the
> beginning of persistent memory area and this partition table would
> describe which parts belong to which programs - so that different programs
> could use persistent memory and not step over each other's data. Is there
> some effort to standardize the partition table ongoing?
>
>
> BTW. some journaling filesystems assume that 512-byte sector is written
> atomically. drivers/block/pmem.c breaks this requirement. Persistent
> memory only gurantees 8-byte atomic writes.
>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>



-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 2522 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:33 Persistent memory interface Mikulas Patocka
                   ` (2 preceding siblings ...)
  2015-06-19 16:53 ` Doug Dumitru
@ 2015-06-19 17:35 ` Dan Williams
  2015-06-19 19:14   ` Mikulas Patocka
  3 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2015-06-19 17:35 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Thomas Gleixner, Andrew Morton, linux-nvdimm,
	Andy Lutomirski, Keith Busch, dm-devel, Borislav Petkov,
	H. Peter Anvin, Ross Zwisler, Christoph Hellwig

On Fri, Jun 19, 2015 at 9:33 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> I looked at the new the persistent memory block device driver
> (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
> interface between them is incorrect.
>
> If I want to use persistent memory in another driver, for a different
> purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
> to this piece of memory and export it? It seems not possible.
> drivers/block/pmem.c attaches to everything without regard that there may
> be other users of persistent memory.

Simply partition the pmem device however you see fit and then
blkdev_get_by_path(<dev>, FMODE_EXCL, <holder>) the resulting
partition(s).

See the BTT infrastructure for how to autodetect and re-assemble this
allocation at sub-system init time:

https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/tree/drivers/nvdimm/btt_devs.c?h=libnvdimm-pending#n450

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 17:35 ` Dan Williams
@ 2015-06-19 19:14   ` Mikulas Patocka
  2015-06-19 19:22     ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Mikulas Patocka @ 2015-06-19 19:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Thomas Gleixner, Andrew Morton, linux-nvdimm,
	Andy Lutomirski, Keith Busch, dm-devel, Borislav Petkov,
	H. Peter Anvin, Ross Zwisler, Christoph Hellwig



On Fri, 19 Jun 2015, Dan Williams wrote:

> On Fri, Jun 19, 2015 at 9:33 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > Hi
> >
> > I looked at the new the persistent memory block device driver
> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
> > interface between them is incorrect.
> >
> > If I want to use persistent memory in another driver, for a different
> > purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
> > to this piece of memory and export it? It seems not possible.
> > drivers/block/pmem.c attaches to everything without regard that there may
> > be other users of persistent memory.
> 
> Simply partition the pmem device however you see fit and then
> blkdev_get_by_path(<dev>, FMODE_EXCL, <holder>) the resulting
> partition(s).

But that still means accessing it through the block layer.

I need to access persistent memory directly - map it into the kernel space 
and access it as mapped memory - and if I do it, it will fight with 
drivers/block/pmem.c over ownership of the memory :-(

Mikulas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 19:14   ` Mikulas Patocka
@ 2015-06-19 19:22     ` Dan Williams
  2015-06-19 19:23       ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2015-06-19 19:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Thomas Gleixner, Andrew Morton, linux-nvdimm,
	Andy Lutomirski, Keith Busch, dm-devel, Borislav Petkov,
	H. Peter Anvin, Ross Zwisler, Christoph Hellwig

On Fri, Jun 19, 2015 at 12:14 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Fri, 19 Jun 2015, Dan Williams wrote:
>
>> On Fri, Jun 19, 2015 at 9:33 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > Hi
>> >
>> > I looked at the new the persistent memory block device driver
>> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
>> > interface between them is incorrect.
>> >
>> > If I want to use persistent memory in another driver, for a different
>> > purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
>> > to this piece of memory and export it? It seems not possible.
>> > drivers/block/pmem.c attaches to everything without regard that there may
>> > be other users of persistent memory.
>>
>> Simply partition the pmem device however you see fit and then
>> blkdev_get_by_path(<dev>, FMODE_EXCL, <holder>) the resulting
>> partition(s).
>
> But that still means accessing it through the block layer.
>
> I need to access persistent memory directly - map it into the kernel space
> and access it as mapped memory - and if I do it, it will fight with
> drivers/block/pmem.c over ownership of the memory :-(
>

See the latest definition of struct block_device_operations in these
patches.  We now have ->direct_access() in the 4.0 kernel and will
have ->rw_bytes() in the 4.2 kernel if you want to directly access the
memory published by a pmem device.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 19:22     ` Dan Williams
@ 2015-06-19 19:23       ` Dan Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2015-06-19 19:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Thomas Gleixner, Andrew Morton, linux-nvdimm,
	Andy Lutomirski, Keith Busch, dm-devel, Borislav Petkov,
	H. Peter Anvin, Ross Zwisler, Christoph Hellwig

On Fri, Jun 19, 2015 at 12:22 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Jun 19, 2015 at 12:14 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>
>> On Fri, 19 Jun 2015, Dan Williams wrote:
>>
>>> On Fri, Jun 19, 2015 at 9:33 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> > Hi
>>> >
>>> > I looked at the new the persistent memory block device driver
>>> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
>>> > interface between them is incorrect.
>>> >
>>> > If I want to use persistent memory in another driver, for a different
>>> > purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
>>> > to this piece of memory and export it? It seems not possible.
>>> > drivers/block/pmem.c attaches to everything without regard that there may
>>> > be other users of persistent memory.
>>>
>>> Simply partition the pmem device however you see fit and then
>>> blkdev_get_by_path(<dev>, FMODE_EXCL, <holder>) the resulting
>>> partition(s).
>>
>> But that still means accessing it through the block layer.
>>
>> I need to access persistent memory directly - map it into the kernel space
>> and access it as mapped memory - and if I do it, it will fight with
>> drivers/block/pmem.c over ownership of the memory :-(
>>
>
> See the latest definition of struct block_device_operations in these
> patches.  We now have ->direct_access() in the 4.0 kernel and will
> have ->rw_bytes() in the 4.2 kernel if you want to directly access the
> memory published by a pmem device.

I'll include you on the next rev of patches that add a
kmap_atomic_pfn_t(): https://lkml.org/lkml/2015/6/5/802

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-19 16:50 ` Verma, Vishal L
@ 2015-06-22 16:50   ` Doug Dumitru
  2015-06-22 19:01     ` Doug Dumitru
  2015-06-22 21:25     ` Verma, Vishal L
  0 siblings, 2 replies; 15+ messages in thread
From: Doug Dumitru @ 2015-06-22 16:50 UTC (permalink / raw)
  To: device-mapper development
  Cc: axboe, linux-nvdimm, hch, luto, Busch, Keith, mpatocka, hpa, bp,
	akpm, ross.zwisler, tglx


[-- Attachment #1.1: Type: text/plain, Size: 4226 bytes --]

I would like to comment on the BTT docs a bit.  There are some design
points you might want to consider.

First, real use cases will have no read/write collisions.  If you think of
a file system, the case of reading a block that is being written or writing
a single block twice just don't happen because the data itself is non
deterministic.  The driver still needs to handle these cases, but
optimizing it for this is not all that logical.

Lets start at the BTT table.  It would be useful if we could distinguish
between a stable block and one that is getting updated now.  An option is
to encode the TRIM/ERROR bits as four "states"  (stable, updating, trimmed,
error) and just use the BTT entry as an index.  This could probably point
directly to the FLOG entry.  The BTT table, at four bytes, has atomic
updates without locks, so two threads can simultaneously update it to point
to the FLOG table and then after the update, see if they won.  If they did
not win, they can wait for the first update to complete.  The FLOG table
could also have a parallel RAM based BTT2 table to store spinlocks or
linked-lists to handle collisions.  Then again, a simple spin or spin/sleep
is probably good enough.

The same works for readers.  If you read a block, check the BTT table after
you finish the read.  If it is the same, your read was good.  If it changed
underneath you, or is pointing to a FLOG block, then you need to wait or
re-read.  Again, the real-world frequency of collisions is very low.  This
would let you eliminate the RTT table entirely.

One final optimization would be to keep the BTT table both in standard RAM
as well as in NV RAM.  If standard RAM is faster, then reads could lookup
blocks without touching the NV driver.  For 512G, this is 1B blocks or 4G
of RAM.  Then again, if the NV RAM is just as fast, this would not help.
Perhaps an option.

I have gotten into a lot of trouble optimizing for fio collisions when
these collisions don't really impact real-workload performance.  The code
has to be "correct" in the collision case, but it does not really need to
be fast.

Doug Dumitru
EasyCo LLC



On Fri, Jun 19, 2015 at 9:50 AM, Verma, Vishal L <vishal.l.verma@intel.com>
wrote:

> On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
> > Hi
> >
> > I looked at the new the persistent memory block device driver
> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
> > interface between them is incorrect.
> >
> > If I want to use persistent memory in another driver, for a different
> > purpose, how can I make sure that that drivers/block/pmem.c doesn't
> attach
> > to this piece of memory and export it? It seems not possible.
> > drivers/block/pmem.c attaches to everything without regard that there may
> > be other users of persistent memory.
> >
> > I think a correct solution would be to add a partition table at the
> > beginning of persistent memory area and this partition table would
> > describe which parts belong to which programs - so that different
> programs
> > could use persistent memory and not step over each other's data. Is there
> > some effort to standardize the partition table ongoing?
> >
> >
> > BTW. some journaling filesystems assume that 512-byte sector is written
> > atomically. drivers/block/pmem.c breaks this requirement. Persistent
> > memory only gurantees 8-byte atomic writes.
>
> Hi Mikulas,
>
> I can answer this part - The idea is that file systems that need sector
> atomicity will use the "Block Translation Table" (BTT) [1]. It would be
> a stacked block device on top of a pmem device (or partition), and file
> systems would be able to use it either for the entire space to get
> atomicity for all blocks, or if they want to use DAX, make two
> partitions, and enable the BTT only on one partition, and use it as the
> logdev.
>
>         -Vishal
>
> [1]: https://lkml.org/lkml/2015/6/17/950
>
> >
> > Mikulas
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>



-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 6365 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-22 16:50   ` Doug Dumitru
@ 2015-06-22 19:01     ` Doug Dumitru
  2015-06-22 21:35       ` Verma, Vishal L
  2015-06-22 21:25     ` Verma, Vishal L
  1 sibling, 1 reply; 15+ messages in thread
From: Doug Dumitru @ 2015-06-22 19:01 UTC (permalink / raw)
  To: device-mapper development
  Cc: axboe, linux-nvdimm, hch, luto, Busch, Keith, mpatocka, hpa, bp,
	akpm, ross.zwisler, tglx


[-- Attachment #1.1: Type: text/plain, Size: 4570 bytes --]

... while you are at it, you should consider supporting other binary block
sizes.

Doug

On Mon, Jun 22, 2015 at 9:50 AM, Doug Dumitru <doug@easyco.com> wrote:

> I would like to comment on the BTT docs a bit.  There are some design
> points you might want to consider.
>
> First, real use cases will have no read/write collisions.  If you think of
> a file system, the case of reading a block that is being written or writing
> a single block twice just don't happen because the data itself is non
> deterministic.  The driver still needs to handle these cases, but
> optimizing it for this is not all that logical.
>
> Lets start at the BTT table.  It would be useful if we could distinguish
> between a stable block and one that is getting updated now.  An option is
> to encode the TRIM/ERROR bits as four "states"  (stable, updating, trimmed,
> error) and just use the BTT entry as an index.  This could probably point
> directly to the FLOG entry.  The BTT table, at four bytes, has atomic
> updates without locks, so two threads can simultaneously update it to point
> to the FLOG table and then after the update, see if they won.  If they did
> not win, they can wait for the first update to complete.  The FLOG table
> could also have a parallel RAM based BTT2 table to store spinlocks or
> linked-lists to handle collisions.  Then again, a simple spin or spin/sleep
> is probably good enough.
>
> The same works for readers.  If you read a block, check the BTT table
> after you finish the read.  If it is the same, your read was good.  If it
> changed underneath you, or is pointing to a FLOG block, then you need to
> wait or re-read.  Again, the real-world frequency of collisions is very
> low.  This would let you eliminate the RTT table entirely.
>
> One final optimization would be to keep the BTT table both in standard RAM
> as well as in NV RAM.  If standard RAM is faster, then reads could lookup
> blocks without touching the NV driver.  For 512G, this is 1B blocks or 4G
> of RAM.  Then again, if the NV RAM is just as fast, this would not help.
> Perhaps an option.
>
> I have gotten into a lot of trouble optimizing for fio collisions when
> these collisions don't really impact real-workload performance.  The code
> has to be "correct" in the collision case, but it does not really need to
> be fast.
>
> Doug Dumitru
> EasyCo LLC
>
>
>
> On Fri, Jun 19, 2015 at 9:50 AM, Verma, Vishal L <vishal.l.verma@intel.com
> > wrote:
>
>> On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
>> > Hi
>> >
>> > I looked at the new the persistent memory block device driver
>> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
>> > interface between them is incorrect.
>> >
>> > If I want to use persistent memory in another driver, for a different
>> > purpose, how can I make sure that that drivers/block/pmem.c doesn't
>> attach
>> > to this piece of memory and export it? It seems not possible.
>> > drivers/block/pmem.c attaches to everything without regard that there
>> may
>> > be other users of persistent memory.
>> >
>> > I think a correct solution would be to add a partition table at the
>> > beginning of persistent memory area and this partition table would
>> > describe which parts belong to which programs - so that different
>> programs
>> > could use persistent memory and not step over each other's data. Is
>> there
>> > some effort to standardize the partition table ongoing?
>> >
>> >
>> > BTW. some journaling filesystems assume that 512-byte sector is written
>> > atomically. drivers/block/pmem.c breaks this requirement. Persistent
>> > memory only gurantees 8-byte atomic writes.
>>
>> Hi Mikulas,
>>
>> I can answer this part - The idea is that file systems that need sector
>> atomicity will use the "Block Translation Table" (BTT) [1]. It would be
>> a stacked block device on top of a pmem device (or partition), and file
>> systems would be able to use it either for the entire space to get
>> atomicity for all blocks, or if they want to use DAX, make two
>> partitions, and enable the BTT only on one partition, and use it as the
>> logdev.
>>
>>         -Vishal
>>
>> [1]: https://lkml.org/lkml/2015/6/17/950
>>
>> >
>> > Mikulas
>> > _______________________________________________
>> > Linux-nvdimm mailing list
>> > Linux-nvdimm@lists.01.org
>> > https://lists.01.org/mailman/listinfo/linux-nvdimm
>>
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
>
>
> --
> Doug Dumitru
> EasyCo LLC
>



-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 7237 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-22 16:50   ` Doug Dumitru
  2015-06-22 19:01     ` Doug Dumitru
@ 2015-06-22 21:25     ` Verma, Vishal L
  2015-06-22 22:17       ` Doug Dumitru
  1 sibling, 1 reply; 15+ messages in thread
From: Verma, Vishal L @ 2015-06-22 21:25 UTC (permalink / raw)
  To: doug
  Cc: Busch, Keith, akpm, linux-nvdimm, luto, axboe, dm-devel,
	mpatocka, hpa, bp, tglx, hch

On Mon, 2015-06-22 at 09:50 -0700, Doug Dumitru wrote:
> I would like to comment on the BTT docs a bit.  There are some design
> points you might want to consider.
> 
Hi Doug,
Thanks for the design review -
> 
> First, real use cases will have no read/write collisions.  If you
> think of a file system, the case of reading a block that is being
> written or writing a single block twice just don't happen because the
> data itself is non deterministic.  The driver still needs to handle
> these cases, but optimizing it for this is not all that logical.
> 
Agreed - we did try to keep this in mind, and I still have a list of
experiments to try that might further reduce the impact of collision
handling on normal IO.
> 
> Lets start at the BTT table.  It would be useful if we could
> distinguish between a stable block and one that is getting updated
> now.  An option is to encode the TRIM/ERROR bits as four "states"
> (stable, updating, trimmed, error) and just use the BTT entry as an
> index.  This could probably point directly to the FLOG entry.  The BTT
> table, at four bytes, has atomic updates without locks, so two threads
> can simultaneously update it to point to the FLOG table and then after
> the update, see if they won.  If they did not win, they can wait for
> the first update to complete.  The FLOG table could also have a
> parallel RAM based BTT2 table to store spinlocks or linked-lists to
> handle collisions.  Then again, a simple spin or spin/sleep is
> probably good enough.

So first off, this made me realize the documentation was slightly dated
w.r.t. the use of the 'E' and 'Z' flags - we currently (and the code
reflects this) do use all four states from those two bits. (The next
posting of this patchset should reflect this change).

Bit      Description
31 - 30	: Error and Zero flags - Used in the following way:
	 Bit		      Description
	31 30
	-----------------------------------------------------------------------
	 00	Initial state. Reads return zeroes; Premap = Postmap
	 01	Zero state: Reads return zeroes
	 10	Error state: Reads fail; Writes clear 'E' bit
	 11	Normal Block – has valid postmap

29 - 0	: Mappings to internal 'postmap' blocks

Apart from this, I didn't quite understand what you meant by:
> ... and just use the BTT entry as an index.  This could probably point
directly to the FLOG entry.

Care to elaborate with an example maybe?
> 
> 
> The same works for readers.  If you read a block, check the BTT table
> after you finish the read.  If it is the same, your read was good.  If
> it changed underneath you, or is pointing to a FLOG block, then you
> need to wait or re-read.  Again, the real-world frequency of
> collisions is very low.  This would let you eliminate the RTT table
> entirely.

There are a couple of issues I see with this - 
1. This could, theoretically, suffer from the A-B-A problem - a writer
could update the same mapping twice, ending up with an apparently
untouched map entry, but the data has gone through churn while the
reader was still reading it.
2. Writing/reading the RTT is cheap as it is entirely in DRAM memory,
whereas another check on the map would mean NVDIMM IO, which could
potentially be slow enough to offset any benefit from getting rid of the
RTT.

One of the pending tests I plan to do is to make the RTT a hash map
instead of a list. i.e. Reader adds a refcount to rtt[postmap ABA], and
writer checks just RTT[free-block]. This keeps things the same for the
reader, but for writers, they are sapred from walking a list, at the
cost of false hash collisions. Obviously, with this path we'd have to
make the RTT a lot bigger than it currently is - and we'd need to figure
out if the increased memory footprint, and hash collisions outweigh the
cost of walking an in-memory list..
> 
> 
> One final optimization would be to keep the BTT table both in standard
> RAM as well as in NV RAM.  If standard RAM is faster, then reads could
> lookup blocks without touching the NV driver.  For 512G, this is 1B
> blocks or 4G of RAM.  Then again, if the NV RAM is just as fast, this
> would not help.  Perhaps an option.

We've thought about this, and while it could help, the added complexity
of keeping the two in sync may make it debatable - we'll have to
re-evaluate once NVDIMMs become more easily available :)
> 
> 
> I have gotten into a lot of trouble optimizing for fio collisions when
> these collisions don't really impact real-workload performance.  The
> code has to be "correct" in the collision case, but it does not really
> need to be fast.
> 
> 
> 
> Doug Dumitru
> 
> EasyCo LLC
> 
> 
> 
> 
> 
> On Fri, Jun 19, 2015 at 9:50 AM, Verma, Vishal L
> <vishal.l.verma@intel.com> wrote:
>         On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
>         > Hi
>         >
>         > I looked at the new the persistent memory block device
>         driver
>         > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it
>         seems that the
>         > interface between them is incorrect.
>         >
>         > If I want to use persistent memory in another driver, for a
>         different
>         > purpose, how can I make sure that that drivers/block/pmem.c
>         doesn't attach
>         > to this piece of memory and export it? It seems not
>         possible.
>         > drivers/block/pmem.c attaches to everything without regard
>         that there may
>         > be other users of persistent memory.
>         >
>         > I think a correct solution would be to add a partition table
>         at the
>         > beginning of persistent memory area and this partition table
>         would
>         > describe which parts belong to which programs - so that
>         different programs
>         > could use persistent memory and not step over each other's
>         data. Is there
>         > some effort to standardize the partition table ongoing?
>         >
>         >
>         > BTW. some journaling filesystems assume that 512-byte sector
>         is written
>         > atomically. drivers/block/pmem.c breaks this requirement.
>         Persistent
>         > memory only gurantees 8-byte atomic writes.
>         
>         
>         Hi Mikulas,
>         
>         I can answer this part - The idea is that file systems that
>         need sector
>         atomicity will use the "Block Translation Table" (BTT) [1]. It
>         would be
>         a stacked block device on top of a pmem device (or partition),
>         and file
>         systems would be able to use it either for the entire space to
>         get
>         atomicity for all blocks, or if they want to use DAX, make two
>         partitions, and enable the BTT only on one partition, and use
>         it as the
>         logdev.
>         
>                 -Vishal
>         
>         [1]: https://lkml.org/lkml/2015/6/17/950
>         
>         >
>         > Mikulas
>         > _______________________________________________
>         > Linux-nvdimm mailing list
>         > Linux-nvdimm@lists.01.org
>         > https://lists.01.org/mailman/listinfo/linux-nvdimm
>         
>         
>         --
>         dm-devel mailing list
>         dm-devel@redhat.com
>         https://www.redhat.com/mailman/listinfo/dm-devel
>         
> 
> 
> 
> -- 
> Doug Dumitru
> EasyCo LLC
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-22 19:01     ` Doug Dumitru
@ 2015-06-22 21:35       ` Verma, Vishal L
  2015-06-23 10:21         ` hch
  0 siblings, 1 reply; 15+ messages in thread
From: Verma, Vishal L @ 2015-06-22 21:35 UTC (permalink / raw)
  To: doug
  Cc: Busch, Keith, akpm, linux-nvdimm, luto, axboe, dm-devel,
	mpatocka, hpa, bp, tglx, hch

On Mon, 2015-06-22 at 12:01 -0700, Doug Dumitru wrote:
> ... while you are at it, you should consider supporting other binary
> block sizes.
> 
Hm, what would be the use case? We do support a variety of metadata
sizes for protection information [1]. Apart from this, the standard
block sizes of 512 and 4096 are the only ones Linux can support, so I'm
not sure what you mean by other binary block sizes?

	-Vishal

[1]: https://lkml.org/lkml/2015/6/17/947

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-22 21:25     ` Verma, Vishal L
@ 2015-06-22 22:17       ` Doug Dumitru
  0 siblings, 0 replies; 15+ messages in thread
From: Doug Dumitru @ 2015-06-22 22:17 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: Busch, Keith, akpm, linux-nvdimm, luto, axboe, dm-devel,
	mpatocka, hpa, bp, tglx, hch


[-- Attachment #1.1: Type: text/plain, Size: 11060 bytes --]

Mr. Verma,

I am used  to doing all of this strictly in DRAM, so my optimization
background has differing overhead points.

On Mon, Jun 22, 2015 at 2:25 PM, Verma, Vishal L <vishal.l.verma@intel.com>
wrote:

> On Mon, 2015-06-22 at 09:50 -0700, Doug Dumitru wrote:
> > I would like to comment on the BTT docs a bit.  There are some design
> > points you might want to consider.
> >
> Hi Doug,
> Thanks for the design review -
> >
> > First, real use cases will have no read/write collisions.  If you
> > think of a file system, the case of reading a block that is being
> > written or writing a single block twice just don't happen because the
> > data itself is non deterministic.  The driver still needs to handle
> > these cases, but optimizing it for this is not all that logical.
> >
> Agreed - we did try to keep this in mind, and I still have a list of
> experiments to try that might further reduce the impact of collision
> handling on normal IO.
> >
> > Lets start at the BTT table.  It would be useful if we could
> > distinguish between a stable block and one that is getting updated
> > now.  An option is to encode the TRIM/ERROR bits as four "states"
> > (stable, updating, trimmed, error) and just use the BTT entry as an
> > index.  This could probably point directly to the FLOG entry.  The BTT
> > table, at four bytes, has atomic updates without locks, so two threads
> > can simultaneously update it to point to the FLOG table and then after
> > the update, see if they won.  If they did not win, they can wait for
> > the first update to complete.  The FLOG table could also have a
> > parallel RAM based BTT2 table to store spinlocks or linked-lists to
> > handle collisions.  Then again, a simple spin or spin/sleep is
> > probably good enough.
>
> So first off, this made me realize the documentation was slightly dated
> w.r.t. the use of the 'E' and 'Z' flags - we currently (and the code
> reflects this) do use all four states from those two bits. (The next
> posting of this patchset should reflect this change).
>
> Bit      Description
> 31 - 30 : Error and Zero flags - Used in the following way:
>          Bit                  Description
>         31 30
>
> -----------------------------------------------------------------------
>          00     Initial state. Reads return zeroes; Premap = Postmap
>          01     Zero state: Reads return zeroes
>          10     Error state: Reads fail; Writes clear 'E' bit
>          11     Normal Block – has valid postmap
>
> 29 - 0  : Mappings to internal 'postmap' blocks
>
> Apart from this, I didn't quite understand what you meant by:
> > ... and just use the BTT entry as an index.  This could probably point
> directly to the FLOG entry.
>
> Care to elaborate with an example maybe?
>

​I am not sure how "committed" to binary region sizes here, but the first
thing that sticks out is that all but the normal state don't really need an
index entry.  If you are willing to suffer some "non binary divide"
operations, you can increase the efficiency of your region by using these 4
bytes as:

0 - Initial state
1 - Zero/trim/discard block, but why not just '0' above
2 - Error
3-F - Future flags
10-FFFF - index into FLOG table
10000-FFFFFFFF - index directly into RAM store

​This pushes the region to just under 4B entries (2TB with 512 and 16TB
with 4K).  If you must stay binary, consider a single bit.​

0 - Initial state
1 - Zero
2 - Error
10-FFFF - index into FLOW table
8FFFFFFFF - FFFFFFFF - index directly into RAM store

​The FLOG table can have pre/post RAM store index (per your current
design), so a read on a write has to do a double lookup, but that might be
good anyway, and the double lookup is only on a collision anyway.

In terms of read overlaps, you could argue you can do this without a lock.
If you read a BTT entry and it is 1000-FFFFFFFF, and you deterministically
control how often a block gets re-used, it is safe to just read even if an
update starts half-way through.  If you want to be paranoid, setup an
atomic for updates that the read can use to protect itself.

read update ATOMIC
read BTT
if BTT is not an index into the FLOG table​ then read data
read update ATOMIC again
If ATOMIC has not moved more than the guaranteed blocks over commit number,
you know the read is valid data and the block has not been overwritten.

The write logic just increments the ATOMIC (and lets it wrap around)
whenever a new block is used.  As long as the avail block first-in
first-out list is slightly bigger than the advertised device capacity, you
can guarantee "N" updates before a read gets the rug pulled out from under
it.

>
> >
> > The same works for readers.  If you read a block, check the BTT table
> > after you finish the read.  If it is the same, your read was good.  If
> > it changed underneath you, or is pointing to a FLOG block, then you
> > need to wait or re-read.  Again, the real-world frequency of
> > collisions is very low.  This would let you eliminate the RTT table
> > entirely.
>
> There are a couple of issues I see with this -
> 1. This could, theoretically, suffer from the A-B-A problem - a writer
> could update the same mapping twice, ending up with an apparently
> untouched map entry, but the data has gone through churn while the
> reader was still reading it.
> 2. Writing/reading the RTT is cheap as it is entirely in DRAM memory,
> whereas another check on the map would mean NVDIMM IO, which could
> potentially be slow enough to offset any benefit from getting rid of the
> RTT.
>

​If the API read is expensive, then you are probably right.  Consider an
ATOMIC DRAM counter as described above.  This requires you to keep a small
number of blocks "spare" so that reads complete before they can get
clobbered, but this lets reads run lock free.​


>
> One of the pending tests I plan to do is to make the RTT a hash map
> instead of a list. i.e. Reader adds a refcount to rtt[postmap ABA], and
> writer checks just RTT[free-block]. This keeps things the same for the
> reader, but for writers, they are sapred from walking a list, at the
> cost of false hash collisions. Obviously, with this path we'd have to
> make the RTT a lot bigger than it currently is - and we'd need to figure
> out if the increased memory footprint, and hash collisions outweigh the
> cost of walking an in-memory list..
>

​If the BTT is a pointer to the FLOG, I think the RTT can go away entirely.​


> >
> >
> > One final optimization would be to keep the BTT table both in standard
> > RAM as well as in NV RAM.  If standard RAM is faster, then reads could
> > lookup blocks without touching the NV driver.  For 512G, this is 1B
> > blocks or 4G of RAM.  Then again, if the NV RAM is just as fast, this
> > would not help.  Perhaps an option.
>
> We've thought about this, and while it could help, the added complexity
> of keeping the two in sync may make it debatable - we'll have to
> re-evaluate once NVDIMMs become more easily available :)
>

​It should just be a double update of the same four bytes to both places.
Update the RAM last but only ACK the write after both are written and
memory fenced.  If a read comes along and gets the RAM image out of sync
with the NV image, you are reading "during" a write so getting the previous
data is still "correct"​

​as you started the read before the write ACKd.
​

> >
> >
> > I have gotten into a lot of trouble optimizing for fio collisions when
> > these collisions don't really impact real-workload performance.  The
> > code has to be "correct" in the collision case, but it does not really
> > need to be fast.
> >
> >
> >
> > Doug Dumitru
> >
> > EasyCo LLC
> >
> >
> >
> >
> >
> > On Fri, Jun 19, 2015 at 9:50 AM, Verma, Vishal L
> > <vishal.l.verma@intel.com> wrote:
> >         On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
> >         > Hi
> >         >
> >         > I looked at the new the persistent memory block device
> >         driver
> >         > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it
> >         seems that the
> >         > interface between them is incorrect.
> >         >
> >         > If I want to use persistent memory in another driver, for a
> >         different
> >         > purpose, how can I make sure that that drivers/block/pmem.c
> >         doesn't attach
> >         > to this piece of memory and export it? It seems not
> >         possible.
> >         > drivers/block/pmem.c attaches to everything without regard
> >         that there may
> >         > be other users of persistent memory.
> >         >
> >         > I think a correct solution would be to add a partition table
> >         at the
> >         > beginning of persistent memory area and this partition table
> >         would
> >         > describe which parts belong to which programs - so that
> >         different programs
> >         > could use persistent memory and not step over each other's
> >         data. Is there
> >         > some effort to standardize the partition table ongoing?
> >         >
> >         >
> >         > BTW. some journaling filesystems assume that 512-byte sector
> >         is written
> >         > atomically. drivers/block/pmem.c breaks this requirement.
> >         Persistent
> >         > memory only gurantees 8-byte atomic writes.
> >
> >
> >         Hi Mikulas,
> >
> >         I can answer this part - The idea is that file systems that
> >         need sector
> >         atomicity will use the "Block Translation Table" (BTT) [1]. It
> >         would be
> >         a stacked block device on top of a pmem device (or partition),
> >         and file
> >         systems would be able to use it either for the entire space to
> >         get
> >         atomicity for all blocks, or if they want to use DAX, make two
> >         partitions, and enable the BTT only on one partition, and use
> >         it as the
> >         logdev.
> >
> >                 -Vishal
> >
> >         [1]: https://lkml.org/lkml/2015/6/17/950
> >
> >         >
> >         > Mikulas
> >         > _______________________________________________
> >         > Linux-nvdimm mailing list
> >         > Linux-nvdimm@lists.01.org
> >         > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
> >
> >         --
> >         dm-devel mailing list
> >         dm-devel@redhat.com
> >         https://www.redhat.com/mailman/listinfo/dm-devel
> >
> >
> >
> >
> > --
> > Doug Dumitru
> > EasyCo LLC
> >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
>
>
​My comment on block sizes was only 512/4K, but the code should be easy to
​setup at anything binary.


-- 
Doug Dumitru
EasyCo LLC

[-- Attachment #1.2: Type: text/html, Size: 17115 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Persistent memory interface
  2015-06-22 21:35       ` Verma, Vishal L
@ 2015-06-23 10:21         ` hch
  0 siblings, 0 replies; 15+ messages in thread
From: hch @ 2015-06-23 10:21 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: Busch, Keith, akpm, linux-nvdimm, doug, axboe, dm-devel,
	mpatocka, hpa, bp, tglx, luto, hch

On Mon, Jun 22, 2015 at 09:35:47PM +0000, Verma, Vishal L wrote:
> Hm, what would be the use case? We do support a variety of metadata
> sizes for protection information [1]. Apart from this, the standard
> block sizes of 512 and 4096 are the only ones Linux can support, so I'm
> not sure what you mean by other binary block sizes?

Not that I think it's useful, but Linux generally supports any
power of two between 512 bytes and the system page size (largest
supported one for mainline architectures seems to be 64k).

There's also been all kinds of talk of supporting larger than page size
block sizes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-06-23 10:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-19 16:33 Persistent memory interface Mikulas Patocka
2015-06-19 16:48 ` Matthew Wilcox
2015-06-19 16:52   ` Mikulas Patocka
2015-06-19 16:50 ` Verma, Vishal L
2015-06-22 16:50   ` Doug Dumitru
2015-06-22 19:01     ` Doug Dumitru
2015-06-22 21:35       ` Verma, Vishal L
2015-06-23 10:21         ` hch
2015-06-22 21:25     ` Verma, Vishal L
2015-06-22 22:17       ` Doug Dumitru
2015-06-19 16:53 ` Doug Dumitru
2015-06-19 17:35 ` Dan Williams
2015-06-19 19:14   ` Mikulas Patocka
2015-06-19 19:22     ` Dan Williams
2015-06-19 19:23       ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.