[LSF/MM TOPIC] Un-addressable device memory and block/fs implications

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:15 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:15 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-block, linux-fsdevel

I would like to discuss un-addressable device memory in the context of
filesystem and block device. Specificaly how to handle write-back, read,
... when a filesystem page is migrated to device memory that CPU can not
access.

I intend to post a patchset leveraging the same idea as the existing
block bounce helper (block/bounce.c) to handle this. I believe this is
worth discussing during summit see how people feels about such plan and
if they have better ideas.

I also like to join discussions on:
  - Peer-to-Peer DMAs between PCIe devices
  - CDM coherent device memory
  - PMEM
  - overall mm discussions

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:15 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:15 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-block, linux-fsdevel

I would like to discuss un-addressable device memory in the context of
filesystem and block device. Specificaly how to handle write-back, read,
... when a filesystem page is migrated to device memory that CPU can not
access.

I intend to post a patchset leveraging the same idea as the existing
block bounce helper (block/bounce.c) to handle this. I believe this is
worth discussing during summit see how people feels about such plan and
if they have better ideas.

I also like to join discussions on:
  - Peer-to-Peer DMAs between PCIe devices
  - CDM coherent device memory
  - PMEM
  - overall mm discussions

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:15 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:15 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-block, linux-fsdevel

I would like to discuss un-addressable device memory in the context of
filesystem and block device. Specificaly how to handle write-back, read,
... when a filesystem page is migrated to device memory that CPU can not
access.

I intend to post a patchset leveraging the same idea as the existing
block bounce helper (block/bounce.c) to handle this. I believe this is
worth discussing during summit see how people feels about such plan and
if they have better ideas.

I also like to join discussions on:
  - Peer-to-Peer DMAs between PCIe devices
  - CDM coherent device memory
  - PMEM
  - overall mm discussions

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 18:15 ` Jerome Glisse
  (?)
@ 2016-12-13 18:20   ` James Bottomley
  -1 siblings, 0 replies; 75+ messages in thread
From: James Bottomley @ 2016-12-13 18:20 UTC (permalink / raw)
  To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context 
> of filesystem and block device. Specificaly how to handle write-back,
> read, ... when a filesystem page is migrated to device memory that 
> CPU can not access.
> 
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this 
> is worth discussing during summit see how people feels about such 
> plan and if they have better ideas.

Isn't this pretty much what the transcendent memory interfaces we
currently have are for?  It's current use cases seem to be compressed
swap and distributed memory, but there doesn't seem to be any reason in
principle why you can't use the interface as well.

James


> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions
> 
> Cheers,
> Jérôme
> --
> To unsubscribe from this list: send the line "unsubscribe linux
> -fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:20   ` James Bottomley
  0 siblings, 0 replies; 75+ messages in thread
From: James Bottomley @ 2016-12-13 18:20 UTC (permalink / raw)
  To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context 
> of filesystem and block device. Specificaly how to handle write-back,
> read, ... when a filesystem page is migrated to device memory that 
> CPU can not access.
> 
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this 
> is worth discussing during summit see how people feels about such 
> plan and if they have better ideas.

Isn't this pretty much what the transcendent memory interfaces we
currently have are for?  It's current use cases seem to be compressed
swap and distributed memory, but there doesn't seem to be any reason in
principle why you can't use the interface as well.

James


> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions
> 
> Cheers,
> Jérôme
> --
> To unsubscribe from this list: send the line "unsubscribe linux
> -fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:20   ` James Bottomley
  0 siblings, 0 replies; 75+ messages in thread
From: James Bottomley @ 2016-12-13 18:20 UTC (permalink / raw)
  To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context 
> of filesystem and block device. Specificaly how to handle write-back,
> read, ... when a filesystem page is migrated to device memory that 
> CPU can not access.
> 
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this 
> is worth discussing during summit see how people feels about such 
> plan and if they have better ideas.

Isn't this pretty much what the transcendent memory interfaces we
currently have are for?  It's current use cases seem to be compressed
swap and distributed memory, but there doesn't seem to be any reason in
principle why you can't use the interface as well.

James


> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions
> 
> Cheers,
> JA(C)rA'me
> --
> To unsubscribe from this list: send the line "unsubscribe linux
> -fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 18:20   ` James Bottomley
  (?)
@ 2016-12-13 18:55     ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:55 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context 
> > of filesystem and block device. Specificaly how to handle write-back,
> > read, ... when a filesystem page is migrated to device memory that 
> > CPU can not access.
> > 
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this 
> > is worth discussing during summit see how people feels about such 
> > plan and if they have better ideas.
> 
> Isn't this pretty much what the transcendent memory interfaces we
> currently have are for?  It's current use cases seem to be compressed
> swap and distributed memory, but there doesn't seem to be any reason in
> principle why you can't use the interface as well.
> 

I am not a specialist of tmem or cleancache but my understand is that
there is no way to allow for file back page to be dirtied while being
in this special memory.

In my case when you migrate a page to the device it might very well be
so that the device can write something in it (results of some sort of
computation). So page might migrate to device memory as clean but
return from it in dirty state.

Second aspect is that even if memory i am dealing with is un-addressable
i still have struct page for it and i want to be able to use regular
page migration.

So given my requirement i didn't thought that cleancache was the way
to address them. Maybe i am wrong.

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:55     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:55 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context 
> > of filesystem and block device. Specificaly how to handle write-back,
> > read, ... when a filesystem page is migrated to device memory that 
> > CPU can not access.
> > 
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this 
> > is worth discussing during summit see how people feels about such 
> > plan and if they have better ideas.
> 
> Isn't this pretty much what the transcendent memory interfaces we
> currently have are for?  It's current use cases seem to be compressed
> swap and distributed memory, but there doesn't seem to be any reason in
> principle why you can't use the interface as well.
> 

I am not a specialist of tmem or cleancache but my understand is that
there is no way to allow for file back page to be dirtied while being
in this special memory.

In my case when you migrate a page to the device it might very well be
so that the device can write something in it (results of some sort of
computation). So page might migrate to device memory as clean but
return from it in dirty state.

Second aspect is that even if memory i am dealing with is un-addressable
i still have struct page for it and i want to be able to use regular
page migration.

So given my requirement i didn't thought that cleancache was the way
to address them. Maybe i am wrong.

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 18:55     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 18:55 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context 
> > of filesystem and block device. Specificaly how to handle write-back,
> > read, ... when a filesystem page is migrated to device memory that 
> > CPU can not access.
> > 
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this 
> > is worth discussing during summit see how people feels about such 
> > plan and if they have better ideas.
> 
> Isn't this pretty much what the transcendent memory interfaces we
> currently have are for?  It's current use cases seem to be compressed
> swap and distributed memory, but there doesn't seem to be any reason in
> principle why you can't use the interface as well.
> 

I am not a specialist of tmem or cleancache but my understand is that
there is no way to allow for file back page to be dirtied while being
in this special memory.

In my case when you migrate a page to the device it might very well be
so that the device can write something in it (results of some sort of
computation). So page might migrate to device memory as clean but
return from it in dirty state.

Second aspect is that even if memory i am dealing with is un-addressable
i still have struct page for it and i want to be able to use regular
page migration.

So given my requirement i didn't thought that cleancache was the way
to address them. Maybe i am wrong.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 18:55     ` Jerome Glisse
@ 2016-12-13 20:01       ` James Bottomley
  -1 siblings, 0 replies; 75+ messages in thread
From: James Bottomley @ 2016-12-13 20:01 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > I would like to discuss un-addressable device memory in the
> > > context 
> > > of filesystem and block device. Specificaly how to handle write
> > > -back,
> > > read, ... when a filesystem page is migrated to device memory
> > > that 
> > > CPU can not access.
> > > 
> > > I intend to post a patchset leveraging the same idea as the
> > > existing
> > > block bounce helper (block/bounce.c) to handle this. I believe
> > > this 
> > > is worth discussing during summit see how people feels about such
> > > plan and if they have better ideas.
> > 
> > Isn't this pretty much what the transcendent memory interfaces we
> > currently have are for?  It's current use cases seem to be
> > compressed
> > swap and distributed memory, but there doesn't seem to be any
> > reason in
> > principle why you can't use the interface as well.
> > 
> 
> I am not a specialist of tmem or cleancache

Well, that makes two of us; I just got to sit through Dan Magenheimer's
talks and some stuff stuck.

>  but my understand is that there is no way to allow for file back 
> page to be dirtied while being in this special memory.

Unless you have some other definition of dirtied, I believe that's what
an exclusive tmem get in frontswap actually does.  It marks the page
dirty when it comes back because it may have been modified.

> In my case when you migrate a page to the device it might very well 
> be so that the device can write something in it (results of some sort 
> of computation). So page might migrate to device memory as clean but
> return from it in dirty state.
> 
> Second aspect is that even if memory i am dealing with is un
> -addressable i still have struct page for it and i want to be able to 
> use regular page migration.

Tmem keeps a struct page ... what's the problem with page migration?
the fact that tmem locks the page when it's not addressable and you
want to be able to migrate the page even when it's not addressable?

> So given my requirement i didn't thought that cleancache was the way
> to address them. Maybe i am wrong.

I'm not saying it is, I just asked if you'd considered it, since the
requirements look similar.

James



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:01       ` James Bottomley
  0 siblings, 0 replies; 75+ messages in thread
From: James Bottomley @ 2016-12-13 20:01 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > I would like to discuss un-addressable device memory in the
> > > context 
> > > of filesystem and block device. Specificaly how to handle write
> > > -back,
> > > read, ... when a filesystem page is migrated to device memory
> > > that 
> > > CPU can not access.
> > > 
> > > I intend to post a patchset leveraging the same idea as the
> > > existing
> > > block bounce helper (block/bounce.c) to handle this. I believe
> > > this 
> > > is worth discussing during summit see how people feels about such
> > > plan and if they have better ideas.
> > 
> > Isn't this pretty much what the transcendent memory interfaces we
> > currently have are for?  It's current use cases seem to be
> > compressed
> > swap and distributed memory, but there doesn't seem to be any
> > reason in
> > principle why you can't use the interface as well.
> > 
> 
> I am not a specialist of tmem or cleancache

Well, that makes two of us; I just got to sit through Dan Magenheimer's
talks and some stuff stuck.

>  but my understand is that there is no way to allow for file back 
> page to be dirtied while being in this special memory.

Unless you have some other definition of dirtied, I believe that's what
an exclusive tmem get in frontswap actually does.  It marks the page
dirty when it comes back because it may have been modified.

> In my case when you migrate a page to the device it might very well 
> be so that the device can write something in it (results of some sort 
> of computation). So page might migrate to device memory as clean but
> return from it in dirty state.
> 
> Second aspect is that even if memory i am dealing with is un
> -addressable i still have struct page for it and i want to be able to 
> use regular page migration.

Tmem keeps a struct page ... what's the problem with page migration?
the fact that tmem locks the page when it's not addressable and you
want to be able to migrate the page even when it's not addressable?

> So given my requirement i didn't thought that cleancache was the way
> to address them. Maybe i am wrong.

I'm not saying it is, I just asked if you'd considered it, since the
requirements look similar.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 18:15 ` Jerome Glisse
@ 2016-12-13 20:15   ` Dave Chinner
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 20:15 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.

You mean pmem that is DAX-capable that suddenly, without warning,
becomes non-DAX capable?

If you are not talking about pmem and DAX, then exactly what does
"when a filesystem page is migrated to device memory that CPU can
not access" mean? What "filesystem page" are we talking about that
can get migrated from main RAM to something the CPU can't access?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:15   ` Dave Chinner
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 20:15 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.

You mean pmem that is DAX-capable that suddenly, without warning,
becomes non-DAX capable?

If you are not talking about pmem and DAX, then exactly what does
"when a filesystem page is migrated to device memory that CPU can
not access" mean? What "filesystem page" are we talking about that
can get migrated from main RAM to something the CPU can't access?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 20:01       ` James Bottomley
  (?)
@ 2016-12-13 20:22         ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:22 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 12:01:04PM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the
> > > > context 
> > > > of filesystem and block device. Specificaly how to handle write
> > > > -back,
> > > > read, ... when a filesystem page is migrated to device memory
> > > > that 
> > > > CPU can not access.
> > > > 
> > > > I intend to post a patchset leveraging the same idea as the
> > > > existing
> > > > block bounce helper (block/bounce.c) to handle this. I believe
> > > > this 
> > > > is worth discussing during summit see how people feels about such
> > > > plan and if they have better ideas.
> > > 
> > > Isn't this pretty much what the transcendent memory interfaces we
> > > currently have are for?  It's current use cases seem to be
> > > compressed
> > > swap and distributed memory, but there doesn't seem to be any
> > > reason in
> > > principle why you can't use the interface as well.
> > > 
> > 
> > I am not a specialist of tmem or cleancache
> 
> Well, that makes two of us; I just got to sit through Dan Magenheimer's
> talks and some stuff stuck.
> 
> >  but my understand is that there is no way to allow for file back 
> > page to be dirtied while being in this special memory.
> 
> Unless you have some other definition of dirtied, I believe that's what
> an exclusive tmem get in frontswap actually does.  It marks the page
> dirty when it comes back because it may have been modified.

Well frontswap only support anonymous or share page, not random filemap
page. So it doesn't help for what i am aiming at :) Note that in my case
the device report accurate dirty information (did the device modified
the page or not) assuming hardware bugs doesn't exist.

> > In my case when you migrate a page to the device it might very well 
> > be so that the device can write something in it (results of some sort 
> > of computation). So page might migrate to device memory as clean but
> > return from it in dirty state.
> > 
> > Second aspect is that even if memory i am dealing with is un
> > -addressable i still have struct page for it and i want to be able to 
> > use regular page migration.
> 
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Well the way cleancache or frontswap works is that they are use when
kernel is trying to make room or evict something. In my case it is the
device that trigger the migration for a range of virtual address of a
process. Sure i can make a weird helper that would force to frontswap
or cleancache pages i want to migrate but it seems counter intuitive
to me.

One extra requirement for me is to be able to easily and quickly find
the migrated page by looking at the CPU page table of the process.
With frontswap it adds a level of indirection where i need to find
through frontswap the memory. With cleancache there isn't even any
information left (the page table entry is cleared).

> 
> > So given my requirement i didn't thought that cleancache was the way
> > to address them. Maybe i am wrong.
> 
> I'm not saying it is, I just asked if you'd considered it, since the
> requirements look similar.

Yes i briefly consider it but from the highlevel overview i had it did
not seems to address all my requirement. Maybe it is because i lack
in depth knowledge of cleancache/frontswap but skiming through code
didn't convince me that i needed to dig deeper.

The solution i am pursuing use struct page and thus everything is as
if it was regular page to the kernel. The only thing that doesn't work
is kmap or mapping it into a process. But this can easily be handled.
For filesystem issues are about anything that do I/O so read/write/
writeback.

In many case if CPU I/O happens what i want to do is migrate back to a
regular page, so the read/write case is easy. But for writeback if page
is dirty on the device and device reports it (calling set_page_dirty())
then i still want to have writeback to work so i don't loose data (if
device dirtied the page it is probably because it was instructed to
save current computations).

With this in mind, the bounce helper design to work around block device
limitation in respect to page they can access seemed to be a perfect fit.
All i care about is providing a bounce page allowing writeback to happen
without having to go through the "slow" page migration back to system
page.

Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:22         ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:22 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 12:01:04PM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the
> > > > context 
> > > > of filesystem and block device. Specificaly how to handle write
> > > > -back,
> > > > read, ... when a filesystem page is migrated to device memory
> > > > that 
> > > > CPU can not access.
> > > > 
> > > > I intend to post a patchset leveraging the same idea as the
> > > > existing
> > > > block bounce helper (block/bounce.c) to handle this. I believe
> > > > this 
> > > > is worth discussing during summit see how people feels about such
> > > > plan and if they have better ideas.
> > > 
> > > Isn't this pretty much what the transcendent memory interfaces we
> > > currently have are for?  It's current use cases seem to be
> > > compressed
> > > swap and distributed memory, but there doesn't seem to be any
> > > reason in
> > > principle why you can't use the interface as well.
> > > 
> > 
> > I am not a specialist of tmem or cleancache
> 
> Well, that makes two of us; I just got to sit through Dan Magenheimer's
> talks and some stuff stuck.
> 
> >  but my understand is that there is no way to allow for file back 
> > page to be dirtied while being in this special memory.
> 
> Unless you have some other definition of dirtied, I believe that's what
> an exclusive tmem get in frontswap actually does.  It marks the page
> dirty when it comes back because it may have been modified.

Well frontswap only support anonymous or share page, not random filemap
page. So it doesn't help for what i am aiming at :) Note that in my case
the device report accurate dirty information (did the device modified
the page or not) assuming hardware bugs doesn't exist.

> > In my case when you migrate a page to the device it might very well 
> > be so that the device can write something in it (results of some sort 
> > of computation). So page might migrate to device memory as clean but
> > return from it in dirty state.
> > 
> > Second aspect is that even if memory i am dealing with is un
> > -addressable i still have struct page for it and i want to be able to 
> > use regular page migration.
> 
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Well the way cleancache or frontswap works is that they are use when
kernel is trying to make room or evict something. In my case it is the
device that trigger the migration for a range of virtual address of a
process. Sure i can make a weird helper that would force to frontswap
or cleancache pages i want to migrate but it seems counter intuitive
to me.

One extra requirement for me is to be able to easily and quickly find
the migrated page by looking at the CPU page table of the process.
With frontswap it adds a level of indirection where i need to find
through frontswap the memory. With cleancache there isn't even any
information left (the page table entry is cleared).

> 
> > So given my requirement i didn't thought that cleancache was the way
> > to address them. Maybe i am wrong.
> 
> I'm not saying it is, I just asked if you'd considered it, since the
> requirements look similar.

Yes i briefly consider it but from the highlevel overview i had it did
not seems to address all my requirement. Maybe it is because i lack
in depth knowledge of cleancache/frontswap but skiming through code
didn't convince me that i needed to dig deeper.

The solution i am pursuing use struct page and thus everything is as
if it was regular page to the kernel. The only thing that doesn't work
is kmap or mapping it into a process. But this can easily be handled.
For filesystem issues are about anything that do I/O so read/write/
writeback.

In many case if CPU I/O happens what i want to do is migrate back to a
regular page, so the read/write case is easy. But for writeback if page
is dirty on the device and device reports it (calling set_page_dirty())
then i still want to have writeback to work so i don't loose data (if
device dirtied the page it is probably because it was instructed to
save current computations).

With this in mind, the bounce helper design to work around block device
limitation in respect to page they can access seemed to be a perfect fit.
All i care about is providing a bounce page allowing writeback to happen
without having to go through the "slow" page migration back to system
page.

Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:22         ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:22 UTC (permalink / raw)
  To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 12:01:04PM -0800, James Bottomley wrote:
> On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the
> > > > context 
> > > > of filesystem and block device. Specificaly how to handle write
> > > > -back,
> > > > read, ... when a filesystem page is migrated to device memory
> > > > that 
> > > > CPU can not access.
> > > > 
> > > > I intend to post a patchset leveraging the same idea as the
> > > > existing
> > > > block bounce helper (block/bounce.c) to handle this. I believe
> > > > this 
> > > > is worth discussing during summit see how people feels about such
> > > > plan and if they have better ideas.
> > > 
> > > Isn't this pretty much what the transcendent memory interfaces we
> > > currently have are for?  It's current use cases seem to be
> > > compressed
> > > swap and distributed memory, but there doesn't seem to be any
> > > reason in
> > > principle why you can't use the interface as well.
> > > 
> > 
> > I am not a specialist of tmem or cleancache
> 
> Well, that makes two of us; I just got to sit through Dan Magenheimer's
> talks and some stuff stuck.
> 
> >  but my understand is that there is no way to allow for file back 
> > page to be dirtied while being in this special memory.
> 
> Unless you have some other definition of dirtied, I believe that's what
> an exclusive tmem get in frontswap actually does.  It marks the page
> dirty when it comes back because it may have been modified.

Well frontswap only support anonymous or share page, not random filemap
page. So it doesn't help for what i am aiming at :) Note that in my case
the device report accurate dirty information (did the device modified
the page or not) assuming hardware bugs doesn't exist.

> > In my case when you migrate a page to the device it might very well 
> > be so that the device can write something in it (results of some sort 
> > of computation). So page might migrate to device memory as clean but
> > return from it in dirty state.
> > 
> > Second aspect is that even if memory i am dealing with is un
> > -addressable i still have struct page for it and i want to be able to 
> > use regular page migration.
> 
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Well the way cleancache or frontswap works is that they are use when
kernel is trying to make room or evict something. In my case it is the
device that trigger the migration for a range of virtual address of a
process. Sure i can make a weird helper that would force to frontswap
or cleancache pages i want to migrate but it seems counter intuitive
to me.

One extra requirement for me is to be able to easily and quickly find
the migrated page by looking at the CPU page table of the process.
With frontswap it adds a level of indirection where i need to find
through frontswap the memory. With cleancache there isn't even any
information left (the page table entry is cleared).

> 
> > So given my requirement i didn't thought that cleancache was the way
> > to address them. Maybe i am wrong.
> 
> I'm not saying it is, I just asked if you'd considered it, since the
> requirements look similar.

Yes i briefly consider it but from the highlevel overview i had it did
not seems to address all my requirement. Maybe it is because i lack
in depth knowledge of cleancache/frontswap but skiming through code
didn't convince me that i needed to dig deeper.

The solution i am pursuing use struct page and thus everything is as
if it was regular page to the kernel. The only thing that doesn't work
is kmap or mapping it into a process. But this can easily be handled.
For filesystem issues are about anything that do I/O so read/write/
writeback.

In many case if CPU I/O happens what i want to do is migrate back to a
regular page, so the read/write case is easy. But for writeback if page
is dirty on the device and device reports it (calling set_page_dirty())
then i still want to have writeback to work so i don't loose data (if
device dirtied the page it is probably because it was instructed to
save current computations).

With this in mind, the bounce helper design to work around block device
limitation in respect to page they can access seemed to be a perfect fit.
All i care about is providing a bounce page allowing writeback to happen
without having to go through the "slow" page migration back to system
page.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 20:01       ` James Bottomley
@ 2016-12-13 20:27         ` Dave Hansen
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2016-12-13 20:27 UTC (permalink / raw)
  To: James Bottomley, Jerome Glisse
  Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On 12/13/2016 12:01 PM, James Bottomley wrote:
>> > Second aspect is that even if memory i am dealing with is un
>> > -addressable i still have struct page for it and i want to be able to 
>> > use regular page migration.
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Hi James,

Why do you say that tmem keeps a 'struct page'?  For instance, its
->put_page operation _takes_ a 'struct page', but that's in the
delete_from_page_cache() path where the page's last reference has been
dropped and it is about to go away.  The role of 'struct page' here is
just to help create a key so that tmem can find the contents later
*without* the original 'struct page'.

Jerome's pages here are a new class of half-crippled 'struct page' which
support more VM features than ZONE_DEVICE pages, but not quite a full
feature set.  It supports (and needs to support) a heck of a lot more VM
features than memory in tmem would, though.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:27         ` Dave Hansen
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2016-12-13 20:27 UTC (permalink / raw)
  To: James Bottomley, Jerome Glisse
  Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On 12/13/2016 12:01 PM, James Bottomley wrote:
>> > Second aspect is that even if memory i am dealing with is un
>> > -addressable i still have struct page for it and i want to be able to 
>> > use regular page migration.
> Tmem keeps a struct page ... what's the problem with page migration?
> the fact that tmem locks the page when it's not addressable and you
> want to be able to migrate the page even when it's not addressable?

Hi James,

Why do you say that tmem keeps a 'struct page'?  For instance, its
->put_page operation _takes_ a 'struct page', but that's in the
delete_from_page_cache() path where the page's last reference has been
dropped and it is about to go away.  The role of 'struct page' here is
just to help create a key so that tmem can find the contents later
*without* the original 'struct page'.

Jerome's pages here are a new class of half-crippled 'struct page' which
support more VM features than ZONE_DEVICE pages, but not quite a full
feature set.  It supports (and needs to support) a heck of a lot more VM
features than memory in tmem would, though.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 20:15   ` Dave Chinner
  (?)
@ 2016-12-13 20:31     ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> 
> You mean pmem that is DAX-capable that suddenly, without warning,
> becomes non-DAX capable?
> 
> If you are not talking about pmem and DAX, then exactly what does
> "when a filesystem page is migrated to device memory that CPU can
> not access" mean? What "filesystem page" are we talking about that
> can get migrated from main RAM to something the CPU can't access?

I am talking about GPU, FPGA, ... any PCIE device that have fast on
board memory that can not be expose transparently to the CPU. I am
reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
https://lwn.net/Articles/706856/

So in my case i am only considering non DAX/PMEM filesystem ie any
"regular" filesystem back by a "regular" block device. I want to be
able to migrate mmaped area of such filesystem to device memory while
the device is actively using that memory.

>From kernel point of view such memory is almost like any other, it
has a struct page and most of the mm code is non the wiser, nor need
to be about it. CPU access trigger a migration back to regular CPU
accessible page.

But for thing like writeback i want to be able to do writeback with-
out having to migrate page back first. So that data can stay on the
device while writeback is happening.

Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:31     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> 
> You mean pmem that is DAX-capable that suddenly, without warning,
> becomes non-DAX capable?
> 
> If you are not talking about pmem and DAX, then exactly what does
> "when a filesystem page is migrated to device memory that CPU can
> not access" mean? What "filesystem page" are we talking about that
> can get migrated from main RAM to something the CPU can't access?

I am talking about GPU, FPGA, ... any PCIE device that have fast on
board memory that can not be expose transparently to the CPU. I am
reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
https://lwn.net/Articles/706856/

So in my case i am only considering non DAX/PMEM filesystem ie any
"regular" filesystem back by a "regular" block device. I want to be
able to migrate mmaped area of such filesystem to device memory while
the device is actively using that memory.

>From kernel point of view such memory is almost like any other, it
has a struct page and most of the mm code is non the wiser, nor need
to be about it. CPU access trigger a migration back to regular CPU
accessible page.

But for thing like writeback i want to be able to do writeback with-
out having to migrate page back first. So that data can stay on the
device while writeback is happening.

Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 20:31     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 20:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> 
> You mean pmem that is DAX-capable that suddenly, without warning,
> becomes non-DAX capable?
> 
> If you are not talking about pmem and DAX, then exactly what does
> "when a filesystem page is migrated to device memory that CPU can
> not access" mean? What "filesystem page" are we talking about that
> can get migrated from main RAM to something the CPU can't access?

I am talking about GPU, FPGA, ... any PCIE device that have fast on
board memory that can not be expose transparently to the CPU. I am
reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
https://lwn.net/Articles/706856/

So in my case i am only considering non DAX/PMEM filesystem ie any
"regular" filesystem back by a "regular" block device. I want to be
able to migrate mmaped area of such filesystem to device memory while
the device is actively using that memory.

>From kernel point of view such memory is almost like any other, it
has a struct page and most of the mm code is non the wiser, nor need
to be about it. CPU access trigger a migration back to regular CPU
accessible page.

But for thing like writeback i want to be able to do writeback with-
out having to migrate page back first. So that data can stay on the
device while writeback is happening.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 20:31     ` Jerome Glisse
@ 2016-12-13 21:10       ` Dave Chinner
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 21:10 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > I would like to discuss un-addressable device memory in the context of
> > > filesystem and block device. Specificaly how to handle write-back, read,
> > > ... when a filesystem page is migrated to device memory that CPU can not
> > > access.
> > 
> > You mean pmem that is DAX-capable that suddenly, without warning,
> > becomes non-DAX capable?
> > 
> > If you are not talking about pmem and DAX, then exactly what does
> > "when a filesystem page is migrated to device memory that CPU can
> > not access" mean? What "filesystem page" are we talking about that
> > can get migrated from main RAM to something the CPU can't access?
> 
> I am talking about GPU, FPGA, ... any PCIE device that have fast on
> board memory that can not be expose transparently to the CPU. I am
> reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> https://lwn.net/Articles/706856/

So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

> So in my case i am only considering non DAX/PMEM filesystem ie any
> "regular" filesystem back by a "regular" block device. I want to be
> able to migrate mmaped area of such filesystem to device memory while
> the device is actively using that memory.

"migrate mmapped area of such filesystem" means what, exactly?

Are you talking about file data contents that have been copied into
the page cache and mmapped into a user process address space?
IOWs, migrating ZONE_NORMAL page cache page content and state
to a new ZONE_DEVICE page, and then migrating back again somehow?

> From kernel point of view such memory is almost like any other, it
> has a struct page and most of the mm code is non the wiser, nor need
> to be about it. CPU access trigger a migration back to regular CPU
> accessible page.

That sounds ... complex. Page migration on page cache access inside
the filesytem IO path locking during read()/write() sounds like
a great way to cause deadlocks....

> But for thing like writeback i want to be able to do writeback with-
> out having to migrate page back first. So that data can stay on the
> device while writeback is happening.

Why can't you do writeback before migration, so only clean pages get
moved?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 21:10       ` Dave Chinner
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 21:10 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > I would like to discuss un-addressable device memory in the context of
> > > filesystem and block device. Specificaly how to handle write-back, read,
> > > ... when a filesystem page is migrated to device memory that CPU can not
> > > access.
> > 
> > You mean pmem that is DAX-capable that suddenly, without warning,
> > becomes non-DAX capable?
> > 
> > If you are not talking about pmem and DAX, then exactly what does
> > "when a filesystem page is migrated to device memory that CPU can
> > not access" mean? What "filesystem page" are we talking about that
> > can get migrated from main RAM to something the CPU can't access?
> 
> I am talking about GPU, FPGA, ... any PCIE device that have fast on
> board memory that can not be expose transparently to the CPU. I am
> reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> https://lwn.net/Articles/706856/

So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

> So in my case i am only considering non DAX/PMEM filesystem ie any
> "regular" filesystem back by a "regular" block device. I want to be
> able to migrate mmaped area of such filesystem to device memory while
> the device is actively using that memory.

"migrate mmapped area of such filesystem" means what, exactly?

Are you talking about file data contents that have been copied into
the page cache and mmapped into a user process address space?
IOWs, migrating ZONE_NORMAL page cache page content and state
to a new ZONE_DEVICE page, and then migrating back again somehow?

> From kernel point of view such memory is almost like any other, it
> has a struct page and most of the mm code is non the wiser, nor need
> to be about it. CPU access trigger a migration back to regular CPU
> accessible page.

That sounds ... complex. Page migration on page cache access inside
the filesytem IO path locking during read()/write() sounds like
a great way to cause deadlocks....

> But for thing like writeback i want to be able to do writeback with-
> out having to migrate page back first. So that data can stay on the
> device while writeback is happening.

Why can't you do writeback before migration, so only clean pages get
moved?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 21:10       ` Dave Chinner
  (?)
@ 2016-12-13 21:24         ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 21:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the context of
> > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > access.
> > > 
> > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > becomes non-DAX capable?
> > > 
> > > If you are not talking about pmem and DAX, then exactly what does
> > > "when a filesystem page is migrated to device memory that CPU can
> > > not access" mean? What "filesystem page" are we talking about that
> > > can get migrated from main RAM to something the CPU can't access?
> > 
> > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > board memory that can not be expose transparently to the CPU. I am
> > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > https://lwn.net/Articles/706856/
> 
> So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

Well not only target, it can be source too. But the device can read
and write any system memory and dma to/from that memory to its on
board memory.

> 
> > So in my case i am only considering non DAX/PMEM filesystem ie any
> > "regular" filesystem back by a "regular" block device. I want to be
> > able to migrate mmaped area of such filesystem to device memory while
> > the device is actively using that memory.
> 
> "migrate mmapped area of such filesystem" means what, exactly?

fd = open("/path/to/some/file")
ptr = mmap(fd, ...);
gpu_compute_something(ptr);

> 
> Are you talking about file data contents that have been copied into
> the page cache and mmapped into a user process address space?
> IOWs, migrating ZONE_NORMAL page cache page content and state
> to a new ZONE_DEVICE page, and then migrating back again somehow?

Take any existing application that mmap a file and allow to migrate
chunk of that mmaped file to device memory without the application
even knowing about it. So nothing special in respect to that mmaped
file. It is a regular file on your filesystem.


> > From kernel point of view such memory is almost like any other, it
> > has a struct page and most of the mm code is non the wiser, nor need
> > to be about it. CPU access trigger a migration back to regular CPU
> > accessible page.
> 
> That sounds ... complex. Page migration on page cache access inside
> the filesytem IO path locking during read()/write() sounds like
> a great way to cause deadlocks....

There are few restriction on device page, no one can do GUP on them and
thus no one can pin them. Hence they can always be migrated back. Yes
each fs need modification, most of it (if not all) is isolated in common
filemap helpers.


> > But for thing like writeback i want to be able to do writeback with-
> > out having to migrate page back first. So that data can stay on the
> > device while writeback is happening.
> 
> Why can't you do writeback before migration, so only clean pages get
> moved?

Because device can write to the page while the page is inside the device
memory and we might want to writeback to disk while page stays in device
memory and computation continues.

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 21:24         ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 21:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the context of
> > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > access.
> > > 
> > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > becomes non-DAX capable?
> > > 
> > > If you are not talking about pmem and DAX, then exactly what does
> > > "when a filesystem page is migrated to device memory that CPU can
> > > not access" mean? What "filesystem page" are we talking about that
> > > can get migrated from main RAM to something the CPU can't access?
> > 
> > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > board memory that can not be expose transparently to the CPU. I am
> > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > https://lwn.net/Articles/706856/
> 
> So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

Well not only target, it can be source too. But the device can read
and write any system memory and dma to/from that memory to its on
board memory.

> 
> > So in my case i am only considering non DAX/PMEM filesystem ie any
> > "regular" filesystem back by a "regular" block device. I want to be
> > able to migrate mmaped area of such filesystem to device memory while
> > the device is actively using that memory.
> 
> "migrate mmapped area of such filesystem" means what, exactly?

fd = open("/path/to/some/file")
ptr = mmap(fd, ...);
gpu_compute_something(ptr);

> 
> Are you talking about file data contents that have been copied into
> the page cache and mmapped into a user process address space?
> IOWs, migrating ZONE_NORMAL page cache page content and state
> to a new ZONE_DEVICE page, and then migrating back again somehow?

Take any existing application that mmap a file and allow to migrate
chunk of that mmaped file to device memory without the application
even knowing about it. So nothing special in respect to that mmaped
file. It is a regular file on your filesystem.


> > From kernel point of view such memory is almost like any other, it
> > has a struct page and most of the mm code is non the wiser, nor need
> > to be about it. CPU access trigger a migration back to regular CPU
> > accessible page.
> 
> That sounds ... complex. Page migration on page cache access inside
> the filesytem IO path locking during read()/write() sounds like
> a great way to cause deadlocks....

There are few restriction on device page, no one can do GUP on them and
thus no one can pin them. Hence they can always be migrated back. Yes
each fs need modification, most of it (if not all) is isolated in common
filemap helpers.


> > But for thing like writeback i want to be able to do writeback with-
> > out having to migrate page back first. So that data can stay on the
> > device while writeback is happening.
> 
> Why can't you do writeback before migration, so only clean pages get
> moved?

Because device can write to the page while the page is inside the device
memory and we might want to writeback to disk while page stays in device
memory and computation continues.

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 21:24         ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 21:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > I would like to discuss un-addressable device memory in the context of
> > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > access.
> > > 
> > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > becomes non-DAX capable?
> > > 
> > > If you are not talking about pmem and DAX, then exactly what does
> > > "when a filesystem page is migrated to device memory that CPU can
> > > not access" mean? What "filesystem page" are we talking about that
> > > can get migrated from main RAM to something the CPU can't access?
> > 
> > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > board memory that can not be expose transparently to the CPU. I am
> > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > https://lwn.net/Articles/706856/
> 
> So ZONE_DEVICE memory that is a DMA target but not CPU addressable?

Well not only target, it can be source too. But the device can read
and write any system memory and dma to/from that memory to its on
board memory.

> 
> > So in my case i am only considering non DAX/PMEM filesystem ie any
> > "regular" filesystem back by a "regular" block device. I want to be
> > able to migrate mmaped area of such filesystem to device memory while
> > the device is actively using that memory.
> 
> "migrate mmapped area of such filesystem" means what, exactly?

fd = open("/path/to/some/file")
ptr = mmap(fd, ...);
gpu_compute_something(ptr);

> 
> Are you talking about file data contents that have been copied into
> the page cache and mmapped into a user process address space?
> IOWs, migrating ZONE_NORMAL page cache page content and state
> to a new ZONE_DEVICE page, and then migrating back again somehow?

Take any existing application that mmap a file and allow to migrate
chunk of that mmaped file to device memory without the application
even knowing about it. So nothing special in respect to that mmaped
file. It is a regular file on your filesystem.


> > From kernel point of view such memory is almost like any other, it
> > has a struct page and most of the mm code is non the wiser, nor need
> > to be about it. CPU access trigger a migration back to regular CPU
> > accessible page.
> 
> That sounds ... complex. Page migration on page cache access inside
> the filesytem IO path locking during read()/write() sounds like
> a great way to cause deadlocks....

There are few restriction on device page, no one can do GUP on them and
thus no one can pin them. Hence they can always be migrated back. Yes
each fs need modification, most of it (if not all) is isolated in common
filemap helpers.


> > But for thing like writeback i want to be able to do writeback with-
> > out having to migrate page back first. So that data can stay on the
> > device while writeback is happening.
> 
> Why can't you do writeback before migration, so only clean pages get
> moved?

Because device can write to the page while the page is inside the device
memory and we might want to writeback to disk while page stays in device
memory and computation continues.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 21:24         ` Jerome Glisse
@ 2016-12-13 22:08           ` Dave Hansen
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2016-12-13 22:08 UTC (permalink / raw)
  To: Jerome Glisse, Dave Chinner
  Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel, Williams, Dan J

On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> 
>>> > > From kernel point of view such memory is almost like any other, it
>>> > > has a struct page and most of the mm code is non the wiser, nor need
>>> > > to be about it. CPU access trigger a migration back to regular CPU
>>> > > accessible page.
>> > 
>> > That sounds ... complex. Page migration on page cache access inside
>> > the filesytem IO path locking during read()/write() sounds like
>> > a great way to cause deadlocks....
> There are few restriction on device page, no one can do GUP on them and
> thus no one can pin them. Hence they can always be migrated back. Yes
> each fs need modification, most of it (if not all) is isolated in common
> filemap helpers.

Huh, that's pretty different from the other ZONE_DEVICE uses.  For
those, you *can* do get_user_pages().

I'd be really interested to see the feature set that these pages have
and how it differs from regular memory and the ZONE_DEVICE memory that
have have in-kernel today.

BTW, how is this restriction implemented?  I would have expected to see
follow_page_pte() or vm_normal_page() getting modified.  I don't see a
single reference to get_user_pages or "GUP" in any of the latest HMM
patch set or the changelogs.

As best I can tell, the slow GUP path will get stuck in a loop inside
follow_page_pte(), while the fast GUP path will allow you to acquire a
reference to the page.  But, maybe I'm reading the code wrong.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 22:08           ` Dave Hansen
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2016-12-13 22:08 UTC (permalink / raw)
  To: Jerome Glisse, Dave Chinner
  Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel, Williams, Dan J

On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> 
>>> > > From kernel point of view such memory is almost like any other, it
>>> > > has a struct page and most of the mm code is non the wiser, nor need
>>> > > to be about it. CPU access trigger a migration back to regular CPU
>>> > > accessible page.
>> > 
>> > That sounds ... complex. Page migration on page cache access inside
>> > the filesytem IO path locking during read()/write() sounds like
>> > a great way to cause deadlocks....
> There are few restriction on device page, no one can do GUP on them and
> thus no one can pin them. Hence they can always be migrated back. Yes
> each fs need modification, most of it (if not all) is isolated in common
> filemap helpers.

Huh, that's pretty different from the other ZONE_DEVICE uses.  For
those, you *can* do get_user_pages().

I'd be really interested to see the feature set that these pages have
and how it differs from regular memory and the ZONE_DEVICE memory that
have have in-kernel today.

BTW, how is this restriction implemented?  I would have expected to see
follow_page_pte() or vm_normal_page() getting modified.  I don't see a
single reference to get_user_pages or "GUP" in any of the latest HMM
patch set or the changelogs.

As best I can tell, the slow GUP path will get stuck in a loop inside
follow_page_pte(), while the fast GUP path will allow you to acquire a
reference to the page.  But, maybe I'm reading the code wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 21:24         ` Jerome Glisse
@ 2016-12-13 22:13           ` Dave Chinner
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 22:13 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.

So you want the device to be able to dirty mmapped pages that the
CPU can't access?

> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);

Thought so. Lots of problems with this.

> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file.

>From the application point of view. Filesystem, page cache, etc
there's substantial problems here...

> It is a regular file on your filesystem.

... because of this.

> > > From kernel point of view such memory is almost like any other, it
> > > has a struct page and most of the mm code is non the wiser, nor need
> > > to be about it. CPU access trigger a migration back to regular CPU
> > > accessible page.
> > 
> > That sounds ... complex. Page migration on page cache access inside
> > the filesytem IO path locking during read()/write() sounds like
> > a great way to cause deadlocks....
> 
> There are few restriction on device page, no one can do GUP on them and
> thus no one can pin them. Hence they can always be migrated back. Yes
> each fs need modification, most of it (if not all) is isolated in common
> filemap helpers.

Sure, but you haven't answered my question: how do you propose we
address the issue of placing all the mm locks required for migration
under the filesystem IO path locks?

> > > But for thing like writeback i want to be able to do writeback with-
> > > out having to migrate page back first. So that data can stay on the
> > > device while writeback is happening.
> > 
> > Why can't you do writeback before migration, so only clean pages get
> > moved?
> 
> Because device can write to the page while the page is inside the device
> memory and we might want to writeback to disk while page stays in device
> memory and computation continues.

Ok. So how does the device trigger ->page_mkwrite on a clean page to
tell the filesystem that the page has been dirtied? So that, for
example, if the page covers a hole because the file is sparse the
filesytem can do the required block allocation and data
initialisation (i.e. zero the cached page) before it gets marked
dirty and any data gets written to it?

And if zeroing the page during such a fault requires CPU access to
the data, how do you propose we handle page migration in the middle
of the page fault to allow the CPU to zero the page? Seems like more
lock order/inversion problems there, too...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 22:13           ` Dave Chinner
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-13 22:13 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.

So you want the device to be able to dirty mmapped pages that the
CPU can't access?

> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);

Thought so. Lots of problems with this.

> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file.

>From the application point of view. Filesystem, page cache, etc
there's substantial problems here...

> It is a regular file on your filesystem.

... because of this.

> > > From kernel point of view such memory is almost like any other, it
> > > has a struct page and most of the mm code is non the wiser, nor need
> > > to be about it. CPU access trigger a migration back to regular CPU
> > > accessible page.
> > 
> > That sounds ... complex. Page migration on page cache access inside
> > the filesytem IO path locking during read()/write() sounds like
> > a great way to cause deadlocks....
> 
> There are few restriction on device page, no one can do GUP on them and
> thus no one can pin them. Hence they can always be migrated back. Yes
> each fs need modification, most of it (if not all) is isolated in common
> filemap helpers.

Sure, but you haven't answered my question: how do you propose we
address the issue of placing all the mm locks required for migration
under the filesystem IO path locks?

> > > But for thing like writeback i want to be able to do writeback with-
> > > out having to migrate page back first. So that data can stay on the
> > > device while writeback is happening.
> > 
> > Why can't you do writeback before migration, so only clean pages get
> > moved?
> 
> Because device can write to the page while the page is inside the device
> memory and we might want to writeback to disk while page stays in device
> memory and computation continues.

Ok. So how does the device trigger ->page_mkwrite on a clean page to
tell the filesystem that the page has been dirtied? So that, for
example, if the page covers a hole because the file is sparse the
filesytem can do the required block allocation and data
initialisation (i.e. zero the cached page) before it gets marked
dirty and any data gets written to it?

And if zeroing the page during such a fault requires CPU access to
the data, how do you propose we handle page migration in the middle
of the page fault to allow the CPU to zero the page? Seems like more
lock order/inversion problems there, too...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 22:13           ` Dave Chinner
  (?)
@ 2016-12-13 22:55             ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 22:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 22:55             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 22:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 22:55             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 22:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 22:08           ` Dave Hansen
  (?)
@ 2016-12-13 23:02             ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 23:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Chinner, lsf-pc, linux-mm, linux-block, linux-fsdevel,
	Williams, Dan J

On Tue, Dec 13, 2016 at 02:08:22PM -0800, Dave Hansen wrote:
> On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> > 
> >>> > > From kernel point of view such memory is almost like any other, it
> >>> > > has a struct page and most of the mm code is non the wiser, nor need
> >>> > > to be about it. CPU access trigger a migration back to regular CPU
> >>> > > accessible page.
> >> > 
> >> > That sounds ... complex. Page migration on page cache access inside
> >> > the filesytem IO path locking during read()/write() sounds like
> >> > a great way to cause deadlocks....
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Huh, that's pretty different from the other ZONE_DEVICE uses.  For
> those, you *can* do get_user_pages().
> 
> I'd be really interested to see the feature set that these pages have
> and how it differs from regular memory and the ZONE_DEVICE memory that
> have have in-kernel today.

Well i can do a list for current patchset where i do not allow migration
of file back page. Roughly you can not kmap and GUP. But GUP has many more
implications like direct I/O (source or destination of direct I/O) ...

> 
> BTW, how is this restriction implemented?  I would have expected to see
> follow_page_pte() or vm_normal_page() getting modified.  I don't see a
> single reference to get_user_pages or "GUP" in any of the latest HMM
> patch set or the changelogs.
> 
> As best I can tell, the slow GUP path will get stuck in a loop inside
> follow_page_pte(), while the fast GUP path will allow you to acquire a
> reference to the page.  But, maybe I'm reading the code wrong.

It is a side effect of having a special swap pte so follow_page_pte()
returns NULL which trigger page fault through handle_mm_fault() which
trigger migration back to regular page. Same for fast GUP version.
There is never a valid pte for an un-addressable page.

Cheers,
Jï¿½rome

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 23:02             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 23:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Chinner, lsf-pc, linux-mm, linux-block, linux-fsdevel,
	Williams, Dan J

On Tue, Dec 13, 2016 at 02:08:22PM -0800, Dave Hansen wrote:
> On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> > 
> >>> > > From kernel point of view such memory is almost like any other, it
> >>> > > has a struct page and most of the mm code is non the wiser, nor need
> >>> > > to be about it. CPU access trigger a migration back to regular CPU
> >>> > > accessible page.
> >> > 
> >> > That sounds ... complex. Page migration on page cache access inside
> >> > the filesytem IO path locking during read()/write() sounds like
> >> > a great way to cause deadlocks....
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Huh, that's pretty different from the other ZONE_DEVICE uses.  For
> those, you *can* do get_user_pages().
> 
> I'd be really interested to see the feature set that these pages have
> and how it differs from regular memory and the ZONE_DEVICE memory that
> have have in-kernel today.

Well i can do a list for current patchset where i do not allow migration
of file back page. Roughly you can not kmap and GUP. But GUP has many more
implications like direct I/O (source or destination of direct I/O) ...

> 
> BTW, how is this restriction implemented?  I would have expected to see
> follow_page_pte() or vm_normal_page() getting modified.  I don't see a
> single reference to get_user_pages or "GUP" in any of the latest HMM
> patch set or the changelogs.
> 
> As best I can tell, the slow GUP path will get stuck in a loop inside
> follow_page_pte(), while the fast GUP path will allow you to acquire a
> reference to the page.  But, maybe I'm reading the code wrong.

It is a side effect of having a special swap pte so follow_page_pte()
returns NULL which trigger page fault through handle_mm_fault() which
trigger migration back to regular page. Same for fast GUP version.
There is never a valid pte for an un-addressable page.

Cheers,
Jï¿½rome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-13 23:02             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-13 23:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Chinner, lsf-pc, linux-mm, linux-block, linux-fsdevel,
	Williams, Dan J

On Tue, Dec 13, 2016 at 02:08:22PM -0800, Dave Hansen wrote:
> On 12/13/2016 01:24 PM, Jerome Glisse wrote:
> > 
> >>> > > From kernel point of view such memory is almost like any other, it
> >>> > > has a struct page and most of the mm code is non the wiser, nor need
> >>> > > to be about it. CPU access trigger a migration back to regular CPU
> >>> > > accessible page.
> >> > 
> >> > That sounds ... complex. Page migration on page cache access inside
> >> > the filesytem IO path locking during read()/write() sounds like
> >> > a great way to cause deadlocks....
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Huh, that's pretty different from the other ZONE_DEVICE uses.  For
> those, you *can* do get_user_pages().
> 
> I'd be really interested to see the feature set that these pages have
> and how it differs from regular memory and the ZONE_DEVICE memory that
> have have in-kernel today.

Well i can do a list for current patchset where i do not allow migration
of file back page. Roughly you can not kmap and GUP. But GUP has many more
implications like direct I/O (source or destination of direct I/O) ...

> 
> BTW, how is this restriction implemented?  I would have expected to see
> follow_page_pte() or vm_normal_page() getting modified.  I don't see a
> single reference to get_user_pages or "GUP" in any of the latest HMM
> patch set or the changelogs.
> 
> As best I can tell, the slow GUP path will get stuck in a loop inside
> follow_page_pte(), while the fast GUP path will allow you to acquire a
> reference to the page.  But, maybe I'm reading the code wrong.

It is a side effect of having a special swap pte so follow_page_pte()
returns NULL which trigger page fault through handle_mm_fault() which
trigger migration back to regular page. Same for fast GUP version.
There is never a valid pte for an un-addressable page.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 22:55             ` Jerome Glisse
@ 2016-12-14  0:14               ` Dave Chinner
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-14  0:14 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > From kernel point of view such memory is almost like any other, it
> > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > accessible page.
> > > > 
> > > > That sounds ... complex. Page migration on page cache access inside
> > > > the filesytem IO path locking during read()/write() sounds like
> > > > a great way to cause deadlocks....
> > > 
> > > There are few restriction on device page, no one can do GUP on them and
> > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > each fs need modification, most of it (if not all) is isolated in common
> > > filemap helpers.
> > 
> > Sure, but you haven't answered my question: how do you propose we
> > address the issue of placing all the mm locks required for migration
> > under the filesystem IO path locks?
> 
> Two different plans (which are non exclusive of each other). First is to use
> workqueue and have read/write wait on the workqueue to be done migrating the
> page back.

Pushing something to a workqueue and then waiting on the workqueue
to complete the work doesn't change lock ordering problems - it
just hides them away and makes them harder to debug.

> Second solution is to use a bounce page during I/O so that there is no need
> for migration.

Which means the page in the device is left with out-of-date
contents, right?

If so, how do you prevent data corruption/loss when the device
has modified the page out of sight of the CPU and the bounce page
doesn't contain those modifications? Or if the dirty device page is
written back directly without containing the changes made in the
bounce page?

Hmmm - what happens when we invalidate and release a range of
file pages that have been migrated to a device? e.g. on truncate?

> > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > out having to migrate page back first. So that data can stay on the
> > > > > device while writeback is happening.
> > > > 
> > > > Why can't you do writeback before migration, so only clean pages get
> > > > moved?
> > > 
> > > Because device can write to the page while the page is inside the device
> > > memory and we might want to writeback to disk while page stays in device
> > > memory and computation continues.
> > 
> > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > tell the filesystem that the page has been dirtied? So that, for
> > example, if the page covers a hole because the file is sparse the
> > filesytem can do the required block allocation and data
> > initialisation (i.e. zero the cached page) before it gets marked
> > dirty and any data gets written to it?
> > 
> > And if zeroing the page during such a fault requires CPU access to
> > the data, how do you propose we handle page migration in the middle
> > of the page fault to allow the CPU to zero the page? Seems like more
> > lock order/inversion problems there, too...
> 
> File back page are never allocated on device, at least we have no incentive
> for usecase we care about today to do so. So a regular page is first use
> and initialize (to zero for hole) before being migrated to device.
> So i do not believe there should be any major concern on ->page_mkwrite.

Such deja vu - inodes are not static objects as modern filesystems
are highly dynamic. If you want to have safe, reliable non-coherent
mmap-based file data offload to devices, then I suspect that we're
going to need pretty much all of the same restrictions the pmem
programming model requires for userspace data flushing. i.e.:

https://lkml.org/lkml/2016/9/15/33

At which point I have to ask: why is mmap considered to be the right
model for transfering data in and out of devices that are not
directly CPU addressable? 

> At least
> this was my impression when i look at generic filemap one, but for some
> filesystem this might need be problematic.

Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably
ext4 and others as well.

> and allowing control by userspace to block such
> migration for given fs.

How do you propose doing that?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14  0:14               ` Dave Chinner
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-14  0:14 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > From kernel point of view such memory is almost like any other, it
> > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > accessible page.
> > > > 
> > > > That sounds ... complex. Page migration on page cache access inside
> > > > the filesytem IO path locking during read()/write() sounds like
> > > > a great way to cause deadlocks....
> > > 
> > > There are few restriction on device page, no one can do GUP on them and
> > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > each fs need modification, most of it (if not all) is isolated in common
> > > filemap helpers.
> > 
> > Sure, but you haven't answered my question: how do you propose we
> > address the issue of placing all the mm locks required for migration
> > under the filesystem IO path locks?
> 
> Two different plans (which are non exclusive of each other). First is to use
> workqueue and have read/write wait on the workqueue to be done migrating the
> page back.

Pushing something to a workqueue and then waiting on the workqueue
to complete the work doesn't change lock ordering problems - it
just hides them away and makes them harder to debug.

> Second solution is to use a bounce page during I/O so that there is no need
> for migration.

Which means the page in the device is left with out-of-date
contents, right?

If so, how do you prevent data corruption/loss when the device
has modified the page out of sight of the CPU and the bounce page
doesn't contain those modifications? Or if the dirty device page is
written back directly without containing the changes made in the
bounce page?

Hmmm - what happens when we invalidate and release a range of
file pages that have been migrated to a device? e.g. on truncate?

> > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > out having to migrate page back first. So that data can stay on the
> > > > > device while writeback is happening.
> > > > 
> > > > Why can't you do writeback before migration, so only clean pages get
> > > > moved?
> > > 
> > > Because device can write to the page while the page is inside the device
> > > memory and we might want to writeback to disk while page stays in device
> > > memory and computation continues.
> > 
> > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > tell the filesystem that the page has been dirtied? So that, for
> > example, if the page covers a hole because the file is sparse the
> > filesytem can do the required block allocation and data
> > initialisation (i.e. zero the cached page) before it gets marked
> > dirty and any data gets written to it?
> > 
> > And if zeroing the page during such a fault requires CPU access to
> > the data, how do you propose we handle page migration in the middle
> > of the page fault to allow the CPU to zero the page? Seems like more
> > lock order/inversion problems there, too...
> 
> File back page are never allocated on device, at least we have no incentive
> for usecase we care about today to do so. So a regular page is first use
> and initialize (to zero for hole) before being migrated to device.
> So i do not believe there should be any major concern on ->page_mkwrite.

Such deja vu - inodes are not static objects as modern filesystems
are highly dynamic. If you want to have safe, reliable non-coherent
mmap-based file data offload to devices, then I suspect that we're
going to need pretty much all of the same restrictions the pmem
programming model requires for userspace data flushing. i.e.:

https://lkml.org/lkml/2016/9/15/33

At which point I have to ask: why is mmap considered to be the right
model for transfering data in and out of devices that are not
directly CPU addressable? 

> At least
> this was my impression when i look at generic filemap one, but for some
> filesystem this might need be problematic.

Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably
ext4 and others as well.

> and allowing control by userspace to block such
> migration for given fs.

How do you propose doing that?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14  0:14               ` Dave Chinner
  (?)
@ 2016-12-14  1:07                 ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14  1:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > accessible page.
> > > > > 
> > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > a great way to cause deadlocks....
> > > > 
> > > > There are few restriction on device page, no one can do GUP on them and
> > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > each fs need modification, most of it (if not all) is isolated in common
> > > > filemap helpers.
> > > 
> > > Sure, but you haven't answered my question: how do you propose we
> > > address the issue of placing all the mm locks required for migration
> > > under the filesystem IO path locks?
> > 
> > Two different plans (which are non exclusive of each other). First is to use
> > workqueue and have read/write wait on the workqueue to be done migrating the
> > page back.
> 
> Pushing something to a workqueue and then waiting on the workqueue
> to complete the work doesn't change lock ordering problems - it
> just hides them away and makes them harder to debug.

Migration doesn't need many lock below is a list and i don't see any lock issue
in respect to ->read or ->write.

 lock_page(page);
 spin_lock_irq(&mapping->tree_lock);
 lock_buffer(bh); // if page has buffer_head
 i_mmap_lock_read(mapping);
 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
    // page table lock for each entry
 }

I don't think i miss any and thus i don't see any real issues here. Care to point
to the lock you think is gona be problematic ?


> > Second solution is to use a bounce page during I/O so that there is no need
> > for migration.
> 
> Which means the page in the device is left with out-of-date
> contents, right?
>
> If so, how do you prevent data corruption/loss when the device
> has modified the page out of sight of the CPU and the bounce page
> doesn't contain those modifications? Or if the dirty device page is
> written back directly without containing the changes made in the
> bounce page?

There is no issue here, if bounce page is use then the page is mark as read
only on the device until write is done and device copy is updated with what
we have been ask to write. So no coherency issue between the 2 copy.

> 
> Hmmm - what happens when we invalidate and release a range of
> file pages that have been migrated to a device? e.g. on truncate?

Same as if it where regular memory, access by device trigger SIGBUS which is
reported through the device API. On that respect it follows the exact same
code path as regular page.

> > > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > > out having to migrate page back first. So that data can stay on the
> > > > > > device while writeback is happening.
> > > > > 
> > > > > Why can't you do writeback before migration, so only clean pages get
> > > > > moved?
> > > > 
> > > > Because device can write to the page while the page is inside the device
> > > > memory and we might want to writeback to disk while page stays in device
> > > > memory and computation continues.
> > > 
> > > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > > tell the filesystem that the page has been dirtied? So that, for
> > > example, if the page covers a hole because the file is sparse the
> > > filesytem can do the required block allocation and data
> > > initialisation (i.e. zero the cached page) before it gets marked
> > > dirty and any data gets written to it?
> > > 
> > > And if zeroing the page during such a fault requires CPU access to
> > > the data, how do you propose we handle page migration in the middle
> > > of the page fault to allow the CPU to zero the page? Seems like more
> > > lock order/inversion problems there, too...
> > 
> > File back page are never allocated on device, at least we have no incentive
> > for usecase we care about today to do so. So a regular page is first use
> > and initialize (to zero for hole) before being migrated to device.
> > So i do not believe there should be any major concern on ->page_mkwrite.
> 
> Such deja vu - inodes are not static objects as modern filesystems
> are highly dynamic. If you want to have safe, reliable non-coherent
> mmap-based file data offload to devices, then I suspect that we're
> going to need pretty much all of the same restrictions the pmem
> programming model requires for userspace data flushing. i.e.:
> 
> https://lkml.org/lkml/2016/9/15/33

I don't see any of the issues in that email applying to my case. Like i said
from fs/mm point of view my page are _exactly_ like regular page. Only thing
is no CPU access. So what would have happen to regular page would happen to
device page. There is no differences here whatsoever.


> 
> At which point I have to ask: why is mmap considered to be the right
> model for transfering data in and out of devices that are not
> directly CPU addressable? 

That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
address space and transparent use of device memory.

There are hardware solution in the making like CCIX or OpenCAPI but not
all players are willing to move forward and let PCIE go. So we will need
a software solution to catter to those platform that decide to stick with
PCIE or otherwise there is a large range of hardware we will not be able
to use to their full potential (rendering them mostly useless on linux).

 
> > At least
> > this was my impression when i look at generic filemap one, but for some
> > filesystem this might need be problematic.
> 
> Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably
> ext4 and others as well.
> 
> > and allowing control by userspace to block such
> > migration for given fs.
> 
> How do you propose doing that?

As a mount flag option is my first idea but i have no strong opinion here.
It might make sense for finer granularity but i don't believe so.

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14  1:07                 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14  1:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > accessible page.
> > > > > 
> > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > a great way to cause deadlocks....
> > > > 
> > > > There are few restriction on device page, no one can do GUP on them and
> > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > each fs need modification, most of it (if not all) is isolated in common
> > > > filemap helpers.
> > > 
> > > Sure, but you haven't answered my question: how do you propose we
> > > address the issue of placing all the mm locks required for migration
> > > under the filesystem IO path locks?
> > 
> > Two different plans (which are non exclusive of each other). First is to use
> > workqueue and have read/write wait on the workqueue to be done migrating the
> > page back.
> 
> Pushing something to a workqueue and then waiting on the workqueue
> to complete the work doesn't change lock ordering problems - it
> just hides them away and makes them harder to debug.

Migration doesn't need many lock below is a list and i don't see any lock issue
in respect to ->read or ->write.

 lock_page(page);
 spin_lock_irq(&mapping->tree_lock);
 lock_buffer(bh); // if page has buffer_head
 i_mmap_lock_read(mapping);
 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
    // page table lock for each entry
 }

I don't think i miss any and thus i don't see any real issues here. Care to point
to the lock you think is gona be problematic ?


> > Second solution is to use a bounce page during I/O so that there is no need
> > for migration.
> 
> Which means the page in the device is left with out-of-date
> contents, right?
>
> If so, how do you prevent data corruption/loss when the device
> has modified the page out of sight of the CPU and the bounce page
> doesn't contain those modifications? Or if the dirty device page is
> written back directly without containing the changes made in the
> bounce page?

There is no issue here, if bounce page is use then the page is mark as read
only on the device until write is done and device copy is updated with what
we have been ask to write. So no coherency issue between the 2 copy.

> 
> Hmmm - what happens when we invalidate and release a range of
> file pages that have been migrated to a device? e.g. on truncate?

Same as if it where regular memory, access by device trigger SIGBUS which is
reported through the device API. On that respect it follows the exact same
code path as regular page.

> > > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > > out having to migrate page back first. So that data can stay on the
> > > > > > device while writeback is happening.
> > > > > 
> > > > > Why can't you do writeback before migration, so only clean pages get
> > > > > moved?
> > > > 
> > > > Because device can write to the page while the page is inside the device
> > > > memory and we might want to writeback to disk while page stays in device
> > > > memory and computation continues.
> > > 
> > > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > > tell the filesystem that the page has been dirtied? So that, for
> > > example, if the page covers a hole because the file is sparse the
> > > filesytem can do the required block allocation and data
> > > initialisation (i.e. zero the cached page) before it gets marked
> > > dirty and any data gets written to it?
> > > 
> > > And if zeroing the page during such a fault requires CPU access to
> > > the data, how do you propose we handle page migration in the middle
> > > of the page fault to allow the CPU to zero the page? Seems like more
> > > lock order/inversion problems there, too...
> > 
> > File back page are never allocated on device, at least we have no incentive
> > for usecase we care about today to do so. So a regular page is first use
> > and initialize (to zero for hole) before being migrated to device.
> > So i do not believe there should be any major concern on ->page_mkwrite.
> 
> Such deja vu - inodes are not static objects as modern filesystems
> are highly dynamic. If you want to have safe, reliable non-coherent
> mmap-based file data offload to devices, then I suspect that we're
> going to need pretty much all of the same restrictions the pmem
> programming model requires for userspace data flushing. i.e.:
> 
> https://lkml.org/lkml/2016/9/15/33

I don't see any of the issues in that email applying to my case. Like i said
from fs/mm point of view my page are _exactly_ like regular page. Only thing
is no CPU access. So what would have happen to regular page would happen to
device page. There is no differences here whatsoever.


> 
> At which point I have to ask: why is mmap considered to be the right
> model for transfering data in and out of devices that are not
> directly CPU addressable? 

That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
address space and transparent use of device memory.

There are hardware solution in the making like CCIX or OpenCAPI but not
all players are willing to move forward and let PCIE go. So we will need
a software solution to catter to those platform that decide to stick with
PCIE or otherwise there is a large range of hardware we will not be able
to use to their full potential (rendering them mostly useless on linux).

 
> > At least
> > this was my impression when i look at generic filemap one, but for some
> > filesystem this might need be problematic.
> 
> Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably
> ext4 and others as well.
> 
> > and allowing control by userspace to block such
> > migration for given fs.
> 
> How do you propose doing that?

As a mount flag option is my first idea but i have no strong opinion here.
It might make sense for finer granularity but i don't believe so.

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14  1:07                 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14  1:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > accessible page.
> > > > > 
> > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > a great way to cause deadlocks....
> > > > 
> > > > There are few restriction on device page, no one can do GUP on them and
> > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > each fs need modification, most of it (if not all) is isolated in common
> > > > filemap helpers.
> > > 
> > > Sure, but you haven't answered my question: how do you propose we
> > > address the issue of placing all the mm locks required for migration
> > > under the filesystem IO path locks?
> > 
> > Two different plans (which are non exclusive of each other). First is to use
> > workqueue and have read/write wait on the workqueue to be done migrating the
> > page back.
> 
> Pushing something to a workqueue and then waiting on the workqueue
> to complete the work doesn't change lock ordering problems - it
> just hides them away and makes them harder to debug.

Migration doesn't need many lock below is a list and i don't see any lock issue
in respect to ->read or ->write.

 lock_page(page);
 spin_lock_irq(&mapping->tree_lock);
 lock_buffer(bh); // if page has buffer_head
 i_mmap_lock_read(mapping);
 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
    // page table lock for each entry
 }

I don't think i miss any and thus i don't see any real issues here. Care to point
to the lock you think is gona be problematic ?


> > Second solution is to use a bounce page during I/O so that there is no need
> > for migration.
> 
> Which means the page in the device is left with out-of-date
> contents, right?
>
> If so, how do you prevent data corruption/loss when the device
> has modified the page out of sight of the CPU and the bounce page
> doesn't contain those modifications? Or if the dirty device page is
> written back directly without containing the changes made in the
> bounce page?

There is no issue here, if bounce page is use then the page is mark as read
only on the device until write is done and device copy is updated with what
we have been ask to write. So no coherency issue between the 2 copy.

> 
> Hmmm - what happens when we invalidate and release a range of
> file pages that have been migrated to a device? e.g. on truncate?

Same as if it where regular memory, access by device trigger SIGBUS which is
reported through the device API. On that respect it follows the exact same
code path as regular page.

> > > > > > But for thing like writeback i want to be able to do writeback with-
> > > > > > out having to migrate page back first. So that data can stay on the
> > > > > > device while writeback is happening.
> > > > > 
> > > > > Why can't you do writeback before migration, so only clean pages get
> > > > > moved?
> > > > 
> > > > Because device can write to the page while the page is inside the device
> > > > memory and we might want to writeback to disk while page stays in device
> > > > memory and computation continues.
> > > 
> > > Ok. So how does the device trigger ->page_mkwrite on a clean page to
> > > tell the filesystem that the page has been dirtied? So that, for
> > > example, if the page covers a hole because the file is sparse the
> > > filesytem can do the required block allocation and data
> > > initialisation (i.e. zero the cached page) before it gets marked
> > > dirty and any data gets written to it?
> > > 
> > > And if zeroing the page during such a fault requires CPU access to
> > > the data, how do you propose we handle page migration in the middle
> > > of the page fault to allow the CPU to zero the page? Seems like more
> > > lock order/inversion problems there, too...
> > 
> > File back page are never allocated on device, at least we have no incentive
> > for usecase we care about today to do so. So a regular page is first use
> > and initialize (to zero for hole) before being migrated to device.
> > So i do not believe there should be any major concern on ->page_mkwrite.
> 
> Such deja vu - inodes are not static objects as modern filesystems
> are highly dynamic. If you want to have safe, reliable non-coherent
> mmap-based file data offload to devices, then I suspect that we're
> going to need pretty much all of the same restrictions the pmem
> programming model requires for userspace data flushing. i.e.:
> 
> https://lkml.org/lkml/2016/9/15/33

I don't see any of the issues in that email applying to my case. Like i said
from fs/mm point of view my page are _exactly_ like regular page. Only thing
is no CPU access. So what would have happen to regular page would happen to
device page. There is no differences here whatsoever.


> 
> At which point I have to ask: why is mmap considered to be the right
> model for transfering data in and out of devices that are not
> directly CPU addressable? 

That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
address space and transparent use of device memory.

There are hardware solution in the making like CCIX or OpenCAPI but not
all players are willing to move forward and let PCIE go. So we will need
a software solution to catter to those platform that decide to stick with
PCIE or otherwise there is a large range of hardware we will not be able
to use to their full potential (rendering them mostly useless on linux).

 
> > At least
> > this was my impression when i look at generic filemap one, but for some
> > filesystem this might need be problematic.
> 
> Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably
> ext4 and others as well.
> 
> > and allowing control by userspace to block such
> > migration for given fs.
> 
> How do you propose doing that?

As a mount flag option is my first idea but i have no strong opinion here.
It might make sense for finer granularity but i don't believe so.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 18:15 ` Jerome Glisse
@ 2016-12-14  3:55   ` Balbir Singh
  -1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2016-12-14  3:55 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 5:15 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.
>
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this is
> worth discussing during summit see how people feels about such plan and
> if they have better ideas.
>
>
Yes, that would be interesting. I presume all of this is for
ZONE_DEVICE and HMM.
I think designing such an interface requires careful thought on tracking pages
to ensure we don't lose writes and also the impact on things like the
writeback subsytem.

>From a HMM perspective and an overall MM perspective, I worry that our
accounting
system is broken with the proposed mirroring and unaddressable memory that
needs to be addressed as well.

It would also be nice to have a discussion on migration patches currently on the
list

1. THP migration
2. HMM migration
3. Async migration

> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory

Yes, this needs discussion. Specifically from is all of CDM memory NORMAL or not
and the special requirements we have today for CDM.

>   - PMEM
>   - overall mm discussions

Balbir Singh.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14  3:55   ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2016-12-14  3:55 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 5:15 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.
>
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this is
> worth discussing during summit see how people feels about such plan and
> if they have better ideas.
>
>
Yes, that would be interesting. I presume all of this is for
ZONE_DEVICE and HMM.
I think designing such an interface requires careful thought on tracking pages
to ensure we don't lose writes and also the impact on things like the
writeback subsytem.

>From a HMM perspective and an overall MM perspective, I worry that our
accounting
system is broken with the proposed mirroring and unaddressable memory that
needs to be addressed as well.

It would also be nice to have a discussion on migration patches currently on the
list

1. THP migration
2. HMM migration
3. Async migration

> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory

Yes, this needs discussion. Specifically from is all of CDM memory NORMAL or not
and the special requirements we have today for CDM.

>   - PMEM
>   - overall mm discussions

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14  1:07                 ` Jerome Glisse
@ 2016-12-14  4:23                   ` Dave Chinner
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-14  4:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > > accessible page.
> > > > > > 
> > > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > a great way to cause deadlocks....
> > > > > 
> > > > > There are few restriction on device page, no one can do GUP on them and
> > > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > > each fs need modification, most of it (if not all) is isolated in common
> > > > > filemap helpers.
> > > > 
> > > > Sure, but you haven't answered my question: how do you propose we
> > > > address the issue of placing all the mm locks required for migration
> > > > under the filesystem IO path locks?
> > > 
> > > Two different plans (which are non exclusive of each other). First is to use
> > > workqueue and have read/write wait on the workqueue to be done migrating the
> > > page back.
> > 
> > Pushing something to a workqueue and then waiting on the workqueue
> > to complete the work doesn't change lock ordering problems - it
> > just hides them away and makes them harder to debug.
> 
> Migration doesn't need many lock below is a list and i don't see any lock issue
> in respect to ->read or ->write.
> 
>  lock_page(page);
>  spin_lock_irq(&mapping->tree_lock);
>  lock_buffer(bh); // if page has buffer_head
>  i_mmap_lock_read(mapping);
>  vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>     // page table lock for each entry
>  }

We can't take the page or mapping tree locks that while we hold
various filesystem locks.

e.g. The IO path lock order is, in places:

inode->i_rwsem
  get page from page cache
  lock_page(page)
  inode->allocation lock
    zero page data

Filesystems are allowed to do this, because the IO path has
guaranteed them access to the page cache data on the page that is
locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
have a locked page, but we don't have access to it's data.

Further, in various filesystems once the allocation lock is taken
(e.g. the i_lock in XFS) we're not allowed to lock pages or the
mapping tree as that leads to deadlocks with truncate, hole punch,
etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
requires migration before the zeroing can occur, we can't perform
migration here.

Why are we even considering migration in situations where we already
hold the ZONE_DEVICE page locked, hold other filesystem locks inside
the page lock, and have an open dirty filesystem transaction as well?

Even if migration si possible and succeeds, the struct page in the
mapping tree for the file offset we are operating on is going to be
different after migration. That implies we need to completely
restart the operation. But given that we've already made changes,
backing out at this point is ...  complex and may not even be
possible.

i.e. we have an architectural assumption that page contents are
always accessable when we have a locked struct page, and your
proposal would appear to violate that assumption...

> > > Second solution is to use a bounce page during I/O so that there is no need
> > > for migration.
> > 
> > Which means the page in the device is left with out-of-date
> > contents, right?
> >
> > If so, how do you prevent data corruption/loss when the device
> > has modified the page out of sight of the CPU and the bounce page
> > doesn't contain those modifications? Or if the dirty device page is
> > written back directly without containing the changes made in the
> > bounce page?
> 
> There is no issue here, if bounce page is use then the page is mark as read
> only on the device until write is done and device copy is updated with what
> we have been ask to write. So no coherency issue between the 2 copy.

What if the page is already dirty on the device? You can't just
"mark it read only" because then you lose any data the device had
written that was not directly overwritten by the IO that needed
bouncing.

Partial page overwrites do occur...

> > > > And if zeroing the page during such a fault requires CPU access to
> > > > the data, how do you propose we handle page migration in the middle
> > > > of the page fault to allow the CPU to zero the page? Seems like more
> > > > lock order/inversion problems there, too...
> > > 
> > > File back page are never allocated on device, at least we have no incentive
> > > for usecase we care about today to do so. So a regular page is first use
> > > and initialize (to zero for hole) before being migrated to device.
> > > So i do not believe there should be any major concern on ->page_mkwrite.
> > 
> > Such deja vu - inodes are not static objects as modern filesystems
> > are highly dynamic. If you want to have safe, reliable non-coherent
> > mmap-based file data offload to devices, then I suspect that we're
> > going to need pretty much all of the same restrictions the pmem
> > programming model requires for userspace data flushing. i.e.:
> > 
> > https://lkml.org/lkml/2016/9/15/33
> 
> I don't see any of the issues in that email applying to my case. Like i said
> from fs/mm point of view my page are _exactly_ like regular page.

Except they aren't...

> Only thing
> is no CPU access.

... because filesystems need direct CPU access to the data the page
points at when migration does not appear to be possible.

FWIW, another nasty corner case I just realised: the file data
requires some kind of data transformation on writeback. e.g.
compression, encryption, parity calculations for RAID, etc. IOWs, it
could be the block device underneath the filesystem that requires
ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters
worse, that can occur in code paths that operate in a "must
guarantee forwards progress" memory allocation context...

> > At which point I have to ask: why is mmap considered to be the right
> > model for transfering data in and out of devices that are not
> > directly CPU addressable? 
> 
> That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
> parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
> address space and transparent use of device memory.

Sure, but that doesn't mean you can just map random files into the
user address space and then hand it off to random hardware and
expect the filesystem to be perfectly happy with that. 

> > > migration for given fs.
> > 
> > How do you propose doing that?
> 
> As a mount flag option is my first idea but i have no strong opinion here.

No, absolutely not. Mount options are not for controlling random
special interest behaviours in filesystems. That makes it impossible
to mix "incompatible" technologies in the same filesystem.

> It might make sense for finer granularity but i don't believe so.

Then you're just not thinking about complex computation engines the
right way, are you?

e.g. you have a pmem filesystem as the central high-speed data store
for you computation engine. Some apps in the pipeline use DAX for
their data access because it's 10x faster than using traditional
buffered mmap access, so the filesystem is mounted "-o dax". But
then you want to add a hardware accelerator to speed up a different
stage of the pipeline by 10x, but it requires page based ZONE_DEVICE
management.

Unfortuantely the "-o zone_device" mount option is incompatible with
"-o dax" and because "it doesn't make sense for DAX to be a fine
grained option" you can't combine the two technologies into the one
pipeline....

That'd really suck, wouldn't it?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14  4:23                   ` Dave Chinner
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2016-12-14  4:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > > accessible page.
> > > > > > 
> > > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > a great way to cause deadlocks....
> > > > > 
> > > > > There are few restriction on device page, no one can do GUP on them and
> > > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > > each fs need modification, most of it (if not all) is isolated in common
> > > > > filemap helpers.
> > > > 
> > > > Sure, but you haven't answered my question: how do you propose we
> > > > address the issue of placing all the mm locks required for migration
> > > > under the filesystem IO path locks?
> > > 
> > > Two different plans (which are non exclusive of each other). First is to use
> > > workqueue and have read/write wait on the workqueue to be done migrating the
> > > page back.
> > 
> > Pushing something to a workqueue and then waiting on the workqueue
> > to complete the work doesn't change lock ordering problems - it
> > just hides them away and makes them harder to debug.
> 
> Migration doesn't need many lock below is a list and i don't see any lock issue
> in respect to ->read or ->write.
> 
>  lock_page(page);
>  spin_lock_irq(&mapping->tree_lock);
>  lock_buffer(bh); // if page has buffer_head
>  i_mmap_lock_read(mapping);
>  vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>     // page table lock for each entry
>  }

We can't take the page or mapping tree locks that while we hold
various filesystem locks.

e.g. The IO path lock order is, in places:

inode->i_rwsem
  get page from page cache
  lock_page(page)
  inode->allocation lock
    zero page data

Filesystems are allowed to do this, because the IO path has
guaranteed them access to the page cache data on the page that is
locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
have a locked page, but we don't have access to it's data.

Further, in various filesystems once the allocation lock is taken
(e.g. the i_lock in XFS) we're not allowed to lock pages or the
mapping tree as that leads to deadlocks with truncate, hole punch,
etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
requires migration before the zeroing can occur, we can't perform
migration here.

Why are we even considering migration in situations where we already
hold the ZONE_DEVICE page locked, hold other filesystem locks inside
the page lock, and have an open dirty filesystem transaction as well?

Even if migration si possible and succeeds, the struct page in the
mapping tree for the file offset we are operating on is going to be
different after migration. That implies we need to completely
restart the operation. But given that we've already made changes,
backing out at this point is ...  complex and may not even be
possible.

i.e. we have an architectural assumption that page contents are
always accessable when we have a locked struct page, and your
proposal would appear to violate that assumption...

> > > Second solution is to use a bounce page during I/O so that there is no need
> > > for migration.
> > 
> > Which means the page in the device is left with out-of-date
> > contents, right?
> >
> > If so, how do you prevent data corruption/loss when the device
> > has modified the page out of sight of the CPU and the bounce page
> > doesn't contain those modifications? Or if the dirty device page is
> > written back directly without containing the changes made in the
> > bounce page?
> 
> There is no issue here, if bounce page is use then the page is mark as read
> only on the device until write is done and device copy is updated with what
> we have been ask to write. So no coherency issue between the 2 copy.

What if the page is already dirty on the device? You can't just
"mark it read only" because then you lose any data the device had
written that was not directly overwritten by the IO that needed
bouncing.

Partial page overwrites do occur...

> > > > And if zeroing the page during such a fault requires CPU access to
> > > > the data, how do you propose we handle page migration in the middle
> > > > of the page fault to allow the CPU to zero the page? Seems like more
> > > > lock order/inversion problems there, too...
> > > 
> > > File back page are never allocated on device, at least we have no incentive
> > > for usecase we care about today to do so. So a regular page is first use
> > > and initialize (to zero for hole) before being migrated to device.
> > > So i do not believe there should be any major concern on ->page_mkwrite.
> > 
> > Such deja vu - inodes are not static objects as modern filesystems
> > are highly dynamic. If you want to have safe, reliable non-coherent
> > mmap-based file data offload to devices, then I suspect that we're
> > going to need pretty much all of the same restrictions the pmem
> > programming model requires for userspace data flushing. i.e.:
> > 
> > https://lkml.org/lkml/2016/9/15/33
> 
> I don't see any of the issues in that email applying to my case. Like i said
> from fs/mm point of view my page are _exactly_ like regular page.

Except they aren't...

> Only thing
> is no CPU access.

... because filesystems need direct CPU access to the data the page
points at when migration does not appear to be possible.

FWIW, another nasty corner case I just realised: the file data
requires some kind of data transformation on writeback. e.g.
compression, encryption, parity calculations for RAID, etc. IOWs, it
could be the block device underneath the filesystem that requires
ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters
worse, that can occur in code paths that operate in a "must
guarantee forwards progress" memory allocation context...

> > At which point I have to ask: why is mmap considered to be the right
> > model for transfering data in and out of devices that are not
> > directly CPU addressable? 
> 
> That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
> parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
> address space and transparent use of device memory.

Sure, but that doesn't mean you can just map random files into the
user address space and then hand it off to random hardware and
expect the filesystem to be perfectly happy with that. 

> > > migration for given fs.
> > 
> > How do you propose doing that?
> 
> As a mount flag option is my first idea but i have no strong opinion here.

No, absolutely not. Mount options are not for controlling random
special interest behaviours in filesystems. That makes it impossible
to mix "incompatible" technologies in the same filesystem.

> It might make sense for finer granularity but i don't believe so.

Then you're just not thinking about complex computation engines the
right way, are you?

e.g. you have a pmem filesystem as the central high-speed data store
for you computation engine. Some apps in the pipeline use DAX for
their data access because it's 10x faster than using traditional
buffered mmap access, so the filesystem is mounted "-o dax". But
then you want to add a hardware accelerator to speed up a different
stage of the pipeline by 10x, but it requires page based ZONE_DEVICE
management.

Unfortuantely the "-o zone_device" mount option is incompatible with
"-o dax" and because "it doesn't make sense for DAX to be a fine
grained option" you can't combine the two technologies into the one
pipeline....

That'd really suck, wouldn't it?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-13 21:24         ` Jerome Glisse
@ 2016-12-14 11:13           ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-14 11:13 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.
> 
> > 
> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);
> 
> > 
> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file. It is a regular file on your filesystem.

OK, so I share most of Dave's concerns about this. But let's talk about
what we can do and what you need and we may find something usable. First
let me understand what is doable / what are the costs on your side.

So we have a page cache page that you'd like to migrate to the device.
Fine. You are willing to sacrifice direct IO - even better. We can fall
back to buffered IO in that case (well, except for XFS which does not do it
but that's a minor detail). One thing I'm not sure about: When a page is
migrated to the device, is its contents available and is just possibly stale
or will something bad happen if we try to access (or even modify) page data?

And by migration you really mean page migration? Be aware that migration of
pagecache pages may be a problem for some pages of some filesystems on its
own - e. g. page migration may fail because there is a filesystem transaction
outstanding modifying that page. For userspace these will be really hard
to understand sporadic errors because it's really filesystem internal
thing. So far page migration was widely used only for free space
defragmentation and for that purpose if page is not migratable for a minute
who cares.

So won't it be easier to leave the pagecache page where it is and *copy* it
to the device? Can the device notify us *before* it is going to modify a
page, not just after it has modified it? Possibly if we just give it the
page read-only and it will have to ask CPU to get write permission? If yes,
then I belive this could work and even fs support should be doable.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14 11:13           ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-14 11:13 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.
> 
> > 
> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);
> 
> > 
> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file. It is a regular file on your filesystem.

OK, so I share most of Dave's concerns about this. But let's talk about
what we can do and what you need and we may find something usable. First
let me understand what is doable / what are the costs on your side.

So we have a page cache page that you'd like to migrate to the device.
Fine. You are willing to sacrifice direct IO - even better. We can fall
back to buffered IO in that case (well, except for XFS which does not do it
but that's a minor detail). One thing I'm not sure about: When a page is
migrated to the device, is its contents available and is just possibly stale
or will something bad happen if we try to access (or even modify) page data?

And by migration you really mean page migration? Be aware that migration of
pagecache pages may be a problem for some pages of some filesystems on its
own - e. g. page migration may fail because there is a filesystem transaction
outstanding modifying that page. For userspace these will be really hard
to understand sporadic errors because it's really filesystem internal
thing. So far page migration was widely used only for free space
defragmentation and for that purpose if page is not migratable for a minute
who cares.

So won't it be easier to leave the pagecache page where it is and *copy* it
to the device? Can the device notify us *before* it is going to modify a
page, not just after it has modified it? Possibly if we just give it the
page read-only and it will have to ask CPU to get write permission? If yes,
then I belive this could work and even fs support should be doable.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14  4:23                   ` Dave Chinner
  (?)
@ 2016-12-14 16:35                     ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > > > accessible page.
> > > > > > > 
> > > > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > > a great way to cause deadlocks....
> > > > > > 
> > > > > > There are few restriction on device page, no one can do GUP on them and
> > > > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > > > each fs need modification, most of it (if not all) is isolated in common
> > > > > > filemap helpers.
> > > > > 
> > > > > Sure, but you haven't answered my question: how do you propose we
> > > > > address the issue of placing all the mm locks required for migration
> > > > > under the filesystem IO path locks?
> > > > 
> > > > Two different plans (which are non exclusive of each other). First is to use
> > > > workqueue and have read/write wait on the workqueue to be done migrating the
> > > > page back.
> > > 
> > > Pushing something to a workqueue and then waiting on the workqueue
> > > to complete the work doesn't change lock ordering problems - it
> > > just hides them away and makes them harder to debug.
> > 
> > Migration doesn't need many lock below is a list and i don't see any lock issue
> > in respect to ->read or ->write.
> > 
> >  lock_page(page);
> >  spin_lock_irq(&mapping->tree_lock);
> >  lock_buffer(bh); // if page has buffer_head
> >  i_mmap_lock_read(mapping);
> >  vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> >     // page table lock for each entry
> >  }
> 
> We can't take the page or mapping tree locks that while we hold
> various filesystem locks.
> 
> e.g. The IO path lock order is, in places:
> 
> inode->i_rwsem
>   get page from page cache
>   lock_page(page)
>   inode->allocation lock
>     zero page data
> 
> Filesystems are allowed to do this, because the IO path has
> guaranteed them access to the page cache data on the page that is
> locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
> have a locked page, but we don't have access to it's data.
> 
> Further, in various filesystems once the allocation lock is taken
> (e.g. the i_lock in XFS) we're not allowed to lock pages or the
> mapping tree as that leads to deadlocks with truncate, hole punch,
> etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
> requires migration before the zeroing can occur, we can't perform
> migration here.
> 
> Why are we even considering migration in situations where we already
> hold the ZONE_DEVICE page locked, hold other filesystem locks inside
> the page lock, and have an open dirty filesystem transaction as well?
> 
> Even if migration si possible and succeeds, the struct page in the
> mapping tree for the file offset we are operating on is going to be
> different after migration. That implies we need to completely
> restart the operation. But given that we've already made changes,
> backing out at this point is ...  complex and may not even be
> possible.

So i skim through xfs code and i still think this is doable. So in the
above sequence:

  inode->i_rwsem
  page = find_get_page();
  if (device_unaddressable(page)) {
     page = migratepage();
  }
  ...

Now there is thing like filemap_write_and_wait...() but thus can be
handled by the bio bounce buffer like i said ie a the block layer we
allocate temporary page, page are already read only on the device as
device obey regular thing like page_mkclean(). So page content is
stable.

The migrate page is using buffer_migrate_page() and i don't see any
deadlock there. So i am not seeing any problem in doing migrate early
on right after page lookup.


> 
> i.e. we have an architectural assumption that page contents are
> always accessable when we have a locked struct page, and your
> proposal would appear to violate that assumption...

And it is, data might be in device memory but you can use bounce
page to access it and you can write protect it on the device so
that it doesn't change.

Looking at xfs, it never does a kmap() directly, only through some
of the generic code and thus are place where we can use bounce page.


 
> > > > Second solution is to use a bounce page during I/O so that there is no need
> > > > for migration.
> > > 
> > > Which means the page in the device is left with out-of-date
> > > contents, right?
> > >
> > > If so, how do you prevent data corruption/loss when the device
> > > has modified the page out of sight of the CPU and the bounce page
> > > doesn't contain those modifications? Or if the dirty device page is
> > > written back directly without containing the changes made in the
> > > bounce page?
> > 
> > There is no issue here, if bounce page is use then the page is mark as read
> > only on the device until write is done and device copy is updated with what
> > we have been ask to write. So no coherency issue between the 2 copy.
> 
> What if the page is already dirty on the device? You can't just
> "mark it read only" because then you lose any data the device had
> written that was not directly overwritten by the IO that needed
> bouncing.
> 
> Partial page overwrites do occur...

I should have been more explicit you:
  - write protect page on device
  - alloc bounce page
  - dma device data to bounce page
  - perform write on bounce page
  - dma bounce page back to device data
  - write io end

It is just like it would be on CPU. There is no data hazard, no loss
of data or incoherency here.

> > > > > And if zeroing the page during such a fault requires CPU access to
> > > > > the data, how do you propose we handle page migration in the middle
> > > > > of the page fault to allow the CPU to zero the page? Seems like more
> > > > > lock order/inversion problems there, too...
> > > > 
> > > > File back page are never allocated on device, at least we have no incentive
> > > > for usecase we care about today to do so. So a regular page is first use
> > > > and initialize (to zero for hole) before being migrated to device.
> > > > So i do not believe there should be any major concern on ->page_mkwrite.
> > > 
> > > Such deja vu - inodes are not static objects as modern filesystems
> > > are highly dynamic. If you want to have safe, reliable non-coherent
> > > mmap-based file data offload to devices, then I suspect that we're
> > > going to need pretty much all of the same restrictions the pmem
> > > programming model requires for userspace data flushing. i.e.:
> > > 
> > > https://lkml.org/lkml/2016/9/15/33
> > 
> > I don't see any of the issues in that email applying to my case. Like i said
> > from fs/mm point of view my page are _exactly_ like regular page.
> 
> Except they aren't...
> 
> > Only thing
> > is no CPU access.
> 
> ... because filesystems need direct CPU access to the data the page
> points at when migration does not appear to be possible.

And it can, the data is always accessible, it is just a matter of using
a bounce page. I did a grep on kmap() and 99% of call site are about
meta-data page which i don't want to migrate. Then there is some in
generic helper for read/write/aio ... this are place where bounce page
can be use if the page is not migrated earlier in the i/o process.

> 
> FWIW, another nasty corner case I just realised: the file data
> requires some kind of data transformation on writeback. e.g.
> compression, encryption, parity calculations for RAID, etc. IOWs, it
> could be the block device underneath the filesystem that requires
> ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters
> worse, that can occur in code paths that operate in a "must
> guarantee forwards progress" memory allocation context...

Well my proposal is about using the bio bounce code, which was done for
ISA block device and i don't see any issue there. We allocate bounce page
copy data from device into bounce page, the block layer does its thing
(compress, encrypt, ...) on the bounce page. It is non the wiser. There
is no migration happening. Note that at this point the page is already
write protected on the device like it would be on the CPU.


> > > At which point I have to ask: why is mmap considered to be the right
> > > model for transfering data in and out of devices that are not
> > > directly CPU addressable? 
> > 
> > That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
> > parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
> > address space and transparent use of device memory.
> 
> Sure, but that doesn't mean you can just map random files into the
> user address space and then hand it off to random hardware and
> expect the filesystem to be perfectly happy with that. 

I am not expecting filesystem will be happy as it is but i am expecting
there is way to make it happy :)


> > > > migration for given fs.
> > > 
> > > How do you propose doing that?
> > 
> > As a mount flag option is my first idea but i have no strong opinion here.
> 
> No, absolutely not. Mount options are not for controlling random
> special interest behaviours in filesystems. That makes it impossible
> to mix "incompatible" technologies in the same filesystem.

I don't have strong opinion here. I just would like to allow sys-admin
to decide somehow if they don't want to allow some fs to be migrated
to device. I don't have good knowledge on what interface would be
appropriate for this.

> 
> > It might make sense for finer granularity but i don't believe so.
> 
> Then you're just not thinking about complex computation engines the
> right way, are you?
> 
> e.g. you have a pmem filesystem as the central high-speed data store
> for you computation engine. Some apps in the pipeline use DAX for
> their data access because it's 10x faster than using traditional
> buffered mmap access, so the filesystem is mounted "-o dax". But
> then you want to add a hardware accelerator to speed up a different
> stage of the pipeline by 10x, but it requires page based ZONE_DEVICE
> management.
> 
> Unfortuantely the "-o zone_device" mount option is incompatible with
> "-o dax" and because "it doesn't make sense for DAX to be a fine
> grained option" you can't combine the two technologies into the one
> pipeline....
> 
> That'd really suck, wouldn't it?

Well i don't to allow migration for dax fs because dax is a different
problem. I think it is only use with pmem and i don't think i want to
allow pmem migration. It would break some assumption people have about
pmem. People using both technology would have to do extra work in there
program to leverage both.

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14 16:35                     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > > > accessible page.
> > > > > > > 
> > > > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > > a great way to cause deadlocks....
> > > > > > 
> > > > > > There are few restriction on device page, no one can do GUP on them and
> > > > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > > > each fs need modification, most of it (if not all) is isolated in common
> > > > > > filemap helpers.
> > > > > 
> > > > > Sure, but you haven't answered my question: how do you propose we
> > > > > address the issue of placing all the mm locks required for migration
> > > > > under the filesystem IO path locks?
> > > > 
> > > > Two different plans (which are non exclusive of each other). First is to use
> > > > workqueue and have read/write wait on the workqueue to be done migrating the
> > > > page back.
> > > 
> > > Pushing something to a workqueue and then waiting on the workqueue
> > > to complete the work doesn't change lock ordering problems - it
> > > just hides them away and makes them harder to debug.
> > 
> > Migration doesn't need many lock below is a list and i don't see any lock issue
> > in respect to ->read or ->write.
> > 
> >  lock_page(page);
> >  spin_lock_irq(&mapping->tree_lock);
> >  lock_buffer(bh); // if page has buffer_head
> >  i_mmap_lock_read(mapping);
> >  vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> >     // page table lock for each entry
> >  }
> 
> We can't take the page or mapping tree locks that while we hold
> various filesystem locks.
> 
> e.g. The IO path lock order is, in places:
> 
> inode->i_rwsem
>   get page from page cache
>   lock_page(page)
>   inode->allocation lock
>     zero page data
> 
> Filesystems are allowed to do this, because the IO path has
> guaranteed them access to the page cache data on the page that is
> locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
> have a locked page, but we don't have access to it's data.
> 
> Further, in various filesystems once the allocation lock is taken
> (e.g. the i_lock in XFS) we're not allowed to lock pages or the
> mapping tree as that leads to deadlocks with truncate, hole punch,
> etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
> requires migration before the zeroing can occur, we can't perform
> migration here.
> 
> Why are we even considering migration in situations where we already
> hold the ZONE_DEVICE page locked, hold other filesystem locks inside
> the page lock, and have an open dirty filesystem transaction as well?
> 
> Even if migration si possible and succeeds, the struct page in the
> mapping tree for the file offset we are operating on is going to be
> different after migration. That implies we need to completely
> restart the operation. But given that we've already made changes,
> backing out at this point is ...  complex and may not even be
> possible.

So i skim through xfs code and i still think this is doable. So in the
above sequence:

  inode->i_rwsem
  page = find_get_page();
  if (device_unaddressable(page)) {
     page = migratepage();
  }
  ...

Now there is thing like filemap_write_and_wait...() but thus can be
handled by the bio bounce buffer like i said ie a the block layer we
allocate temporary page, page are already read only on the device as
device obey regular thing like page_mkclean(). So page content is
stable.

The migrate page is using buffer_migrate_page() and i don't see any
deadlock there. So i am not seeing any problem in doing migrate early
on right after page lookup.


> 
> i.e. we have an architectural assumption that page contents are
> always accessable when we have a locked struct page, and your
> proposal would appear to violate that assumption...

And it is, data might be in device memory but you can use bounce
page to access it and you can write protect it on the device so
that it doesn't change.

Looking at xfs, it never does a kmap() directly, only through some
of the generic code and thus are place where we can use bounce page.


 
> > > > Second solution is to use a bounce page during I/O so that there is no need
> > > > for migration.
> > > 
> > > Which means the page in the device is left with out-of-date
> > > contents, right?
> > >
> > > If so, how do you prevent data corruption/loss when the device
> > > has modified the page out of sight of the CPU and the bounce page
> > > doesn't contain those modifications? Or if the dirty device page is
> > > written back directly without containing the changes made in the
> > > bounce page?
> > 
> > There is no issue here, if bounce page is use then the page is mark as read
> > only on the device until write is done and device copy is updated with what
> > we have been ask to write. So no coherency issue between the 2 copy.
> 
> What if the page is already dirty on the device? You can't just
> "mark it read only" because then you lose any data the device had
> written that was not directly overwritten by the IO that needed
> bouncing.
> 
> Partial page overwrites do occur...

I should have been more explicit you:
  - write protect page on device
  - alloc bounce page
  - dma device data to bounce page
  - perform write on bounce page
  - dma bounce page back to device data
  - write io end

It is just like it would be on CPU. There is no data hazard, no loss
of data or incoherency here.

> > > > > And if zeroing the page during such a fault requires CPU access to
> > > > > the data, how do you propose we handle page migration in the middle
> > > > > of the page fault to allow the CPU to zero the page? Seems like more
> > > > > lock order/inversion problems there, too...
> > > > 
> > > > File back page are never allocated on device, at least we have no incentive
> > > > for usecase we care about today to do so. So a regular page is first use
> > > > and initialize (to zero for hole) before being migrated to device.
> > > > So i do not believe there should be any major concern on ->page_mkwrite.
> > > 
> > > Such deja vu - inodes are not static objects as modern filesystems
> > > are highly dynamic. If you want to have safe, reliable non-coherent
> > > mmap-based file data offload to devices, then I suspect that we're
> > > going to need pretty much all of the same restrictions the pmem
> > > programming model requires for userspace data flushing. i.e.:
> > > 
> > > https://lkml.org/lkml/2016/9/15/33
> > 
> > I don't see any of the issues in that email applying to my case. Like i said
> > from fs/mm point of view my page are _exactly_ like regular page.
> 
> Except they aren't...
> 
> > Only thing
> > is no CPU access.
> 
> ... because filesystems need direct CPU access to the data the page
> points at when migration does not appear to be possible.

And it can, the data is always accessible, it is just a matter of using
a bounce page. I did a grep on kmap() and 99% of call site are about
meta-data page which i don't want to migrate. Then there is some in
generic helper for read/write/aio ... this are place where bounce page
can be use if the page is not migrated earlier in the i/o process.

> 
> FWIW, another nasty corner case I just realised: the file data
> requires some kind of data transformation on writeback. e.g.
> compression, encryption, parity calculations for RAID, etc. IOWs, it
> could be the block device underneath the filesystem that requires
> ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters
> worse, that can occur in code paths that operate in a "must
> guarantee forwards progress" memory allocation context...

Well my proposal is about using the bio bounce code, which was done for
ISA block device and i don't see any issue there. We allocate bounce page
copy data from device into bounce page, the block layer does its thing
(compress, encrypt, ...) on the bounce page. It is non the wiser. There
is no migration happening. Note that at this point the page is already
write protected on the device like it would be on the CPU.


> > > At which point I have to ask: why is mmap considered to be the right
> > > model for transfering data in and out of devices that are not
> > > directly CPU addressable? 
> > 
> > That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
> > parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
> > address space and transparent use of device memory.
> 
> Sure, but that doesn't mean you can just map random files into the
> user address space and then hand it off to random hardware and
> expect the filesystem to be perfectly happy with that. 

I am not expecting filesystem will be happy as it is but i am expecting
there is way to make it happy :)


> > > > migration for given fs.
> > > 
> > > How do you propose doing that?
> > 
> > As a mount flag option is my first idea but i have no strong opinion here.
> 
> No, absolutely not. Mount options are not for controlling random
> special interest behaviours in filesystems. That makes it impossible
> to mix "incompatible" technologies in the same filesystem.

I don't have strong opinion here. I just would like to allow sys-admin
to decide somehow if they don't want to allow some fs to be migrated
to device. I don't have good knowledge on what interface would be
appropriate for this.

> 
> > It might make sense for finer granularity but i don't believe so.
> 
> Then you're just not thinking about complex computation engines the
> right way, are you?
> 
> e.g. you have a pmem filesystem as the central high-speed data store
> for you computation engine. Some apps in the pipeline use DAX for
> their data access because it's 10x faster than using traditional
> buffered mmap access, so the filesystem is mounted "-o dax". But
> then you want to add a hardware accelerator to speed up a different
> stage of the pipeline by 10x, but it requires page based ZONE_DEVICE
> management.
> 
> Unfortuantely the "-o zone_device" mount option is incompatible with
> "-o dax" and because "it doesn't make sense for DAX to be a fine
> grained option" you can't combine the two technologies into the one
> pipeline....
> 
> That'd really suck, wouldn't it?

Well i don't to allow migration for dax fs because dax is a different
problem. I think it is only use with pmem and i don't think i want to
allow pmem migration. It would break some assumption people have about
pmem. People using both technology would have to do extra work in there
program to leverage both.

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14 16:35                     ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel

On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > > From kernel point of view such memory is almost like any other, it
> > > > > > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > > > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > > > > > accessible page.
> > > > > > > 
> > > > > > > That sounds ... complex. Page migration on page cache access inside
> > > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > > a great way to cause deadlocks....
> > > > > > 
> > > > > > There are few restriction on device page, no one can do GUP on them and
> > > > > > thus no one can pin them. Hence they can always be migrated back. Yes
> > > > > > each fs need modification, most of it (if not all) is isolated in common
> > > > > > filemap helpers.
> > > > > 
> > > > > Sure, but you haven't answered my question: how do you propose we
> > > > > address the issue of placing all the mm locks required for migration
> > > > > under the filesystem IO path locks?
> > > > 
> > > > Two different plans (which are non exclusive of each other). First is to use
> > > > workqueue and have read/write wait on the workqueue to be done migrating the
> > > > page back.
> > > 
> > > Pushing something to a workqueue and then waiting on the workqueue
> > > to complete the work doesn't change lock ordering problems - it
> > > just hides them away and makes them harder to debug.
> > 
> > Migration doesn't need many lock below is a list and i don't see any lock issue
> > in respect to ->read or ->write.
> > 
> >  lock_page(page);
> >  spin_lock_irq(&mapping->tree_lock);
> >  lock_buffer(bh); // if page has buffer_head
> >  i_mmap_lock_read(mapping);
> >  vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> >     // page table lock for each entry
> >  }
> 
> We can't take the page or mapping tree locks that while we hold
> various filesystem locks.
> 
> e.g. The IO path lock order is, in places:
> 
> inode->i_rwsem
>   get page from page cache
>   lock_page(page)
>   inode->allocation lock
>     zero page data
> 
> Filesystems are allowed to do this, because the IO path has
> guaranteed them access to the page cache data on the page that is
> locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
> have a locked page, but we don't have access to it's data.
> 
> Further, in various filesystems once the allocation lock is taken
> (e.g. the i_lock in XFS) we're not allowed to lock pages or the
> mapping tree as that leads to deadlocks with truncate, hole punch,
> etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
> requires migration before the zeroing can occur, we can't perform
> migration here.
> 
> Why are we even considering migration in situations where we already
> hold the ZONE_DEVICE page locked, hold other filesystem locks inside
> the page lock, and have an open dirty filesystem transaction as well?
> 
> Even if migration si possible and succeeds, the struct page in the
> mapping tree for the file offset we are operating on is going to be
> different after migration. That implies we need to completely
> restart the operation. But given that we've already made changes,
> backing out at this point is ...  complex and may not even be
> possible.

So i skim through xfs code and i still think this is doable. So in the
above sequence:

  inode->i_rwsem
  page = find_get_page();
  if (device_unaddressable(page)) {
     page = migratepage();
  }
  ...

Now there is thing like filemap_write_and_wait...() but thus can be
handled by the bio bounce buffer like i said ie a the block layer we
allocate temporary page, page are already read only on the device as
device obey regular thing like page_mkclean(). So page content is
stable.

The migrate page is using buffer_migrate_page() and i don't see any
deadlock there. So i am not seeing any problem in doing migrate early
on right after page lookup.


> 
> i.e. we have an architectural assumption that page contents are
> always accessable when we have a locked struct page, and your
> proposal would appear to violate that assumption...

And it is, data might be in device memory but you can use bounce
page to access it and you can write protect it on the device so
that it doesn't change.

Looking at xfs, it never does a kmap() directly, only through some
of the generic code and thus are place where we can use bounce page.


 
> > > > Second solution is to use a bounce page during I/O so that there is no need
> > > > for migration.
> > > 
> > > Which means the page in the device is left with out-of-date
> > > contents, right?
> > >
> > > If so, how do you prevent data corruption/loss when the device
> > > has modified the page out of sight of the CPU and the bounce page
> > > doesn't contain those modifications? Or if the dirty device page is
> > > written back directly without containing the changes made in the
> > > bounce page?
> > 
> > There is no issue here, if bounce page is use then the page is mark as read
> > only on the device until write is done and device copy is updated with what
> > we have been ask to write. So no coherency issue between the 2 copy.
> 
> What if the page is already dirty on the device? You can't just
> "mark it read only" because then you lose any data the device had
> written that was not directly overwritten by the IO that needed
> bouncing.
> 
> Partial page overwrites do occur...

I should have been more explicit you:
  - write protect page on device
  - alloc bounce page
  - dma device data to bounce page
  - perform write on bounce page
  - dma bounce page back to device data
  - write io end

It is just like it would be on CPU. There is no data hazard, no loss
of data or incoherency here.

> > > > > And if zeroing the page during such a fault requires CPU access to
> > > > > the data, how do you propose we handle page migration in the middle
> > > > > of the page fault to allow the CPU to zero the page? Seems like more
> > > > > lock order/inversion problems there, too...
> > > > 
> > > > File back page are never allocated on device, at least we have no incentive
> > > > for usecase we care about today to do so. So a regular page is first use
> > > > and initialize (to zero for hole) before being migrated to device.
> > > > So i do not believe there should be any major concern on ->page_mkwrite.
> > > 
> > > Such deja vu - inodes are not static objects as modern filesystems
> > > are highly dynamic. If you want to have safe, reliable non-coherent
> > > mmap-based file data offload to devices, then I suspect that we're
> > > going to need pretty much all of the same restrictions the pmem
> > > programming model requires for userspace data flushing. i.e.:
> > > 
> > > https://lkml.org/lkml/2016/9/15/33
> > 
> > I don't see any of the issues in that email applying to my case. Like i said
> > from fs/mm point of view my page are _exactly_ like regular page.
> 
> Except they aren't...
> 
> > Only thing
> > is no CPU access.
> 
> ... because filesystems need direct CPU access to the data the page
> points at when migration does not appear to be possible.

And it can, the data is always accessible, it is just a matter of using
a bounce page. I did a grep on kmap() and 99% of call site are about
meta-data page which i don't want to migrate. Then there is some in
generic helper for read/write/aio ... this are place where bounce page
can be use if the page is not migrated earlier in the i/o process.

> 
> FWIW, another nasty corner case I just realised: the file data
> requires some kind of data transformation on writeback. e.g.
> compression, encryption, parity calculations for RAID, etc. IOWs, it
> could be the block device underneath the filesystem that requires
> ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters
> worse, that can occur in code paths that operate in a "must
> guarantee forwards progress" memory allocation context...

Well my proposal is about using the bio bounce code, which was done for
ISA block device and i don't see any issue there. We allocate bounce page
copy data from device into bounce page, the block layer does its thing
(compress, encrypt, ...) on the bounce page. It is non the wiser. There
is no migration happening. Note that at this point the page is already
write protected on the device like it would be on the CPU.


> > > At which point I have to ask: why is mmap considered to be the right
> > > model for transfering data in and out of devices that are not
> > > directly CPU addressable? 
> > 
> > That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and
> > parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified
> > address space and transparent use of device memory.
> 
> Sure, but that doesn't mean you can just map random files into the
> user address space and then hand it off to random hardware and
> expect the filesystem to be perfectly happy with that. 

I am not expecting filesystem will be happy as it is but i am expecting
there is way to make it happy :)


> > > > migration for given fs.
> > > 
> > > How do you propose doing that?
> > 
> > As a mount flag option is my first idea but i have no strong opinion here.
> 
> No, absolutely not. Mount options are not for controlling random
> special interest behaviours in filesystems. That makes it impossible
> to mix "incompatible" technologies in the same filesystem.

I don't have strong opinion here. I just would like to allow sys-admin
to decide somehow if they don't want to allow some fs to be migrated
to device. I don't have good knowledge on what interface would be
appropriate for this.

> 
> > It might make sense for finer granularity but i don't believe so.
> 
> Then you're just not thinking about complex computation engines the
> right way, are you?
> 
> e.g. you have a pmem filesystem as the central high-speed data store
> for you computation engine. Some apps in the pipeline use DAX for
> their data access because it's 10x faster than using traditional
> buffered mmap access, so the filesystem is mounted "-o dax". But
> then you want to add a hardware accelerator to speed up a different
> stage of the pipeline by 10x, but it requires page based ZONE_DEVICE
> management.
> 
> Unfortuantely the "-o zone_device" mount option is incompatible with
> "-o dax" and because "it doesn't make sense for DAX to be a fine
> grained option" you can't combine the two technologies into the one
> pipeline....
> 
> That'd really suck, wouldn't it?

Well i don't to allow migration for dax fs because dax is a different
problem. I think it is only use with pmem and i don't think i want to
allow pmem migration. It would break some assumption people have about
pmem. People using both technology would have to do extra work in there
program to leverage both.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14 11:13           ` Jan Kara
  (?)
@ 2016-12-14 17:15             ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 17:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Wed, Dec 14, 2016 at 12:13:51PM +0100, Jan Kara wrote:
> On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> > 
> > > 
> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> > 
> > > 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file. It is a regular file on your filesystem.
> 
> OK, so I share most of Dave's concerns about this. But let's talk about
> what we can do and what you need and we may find something usable. First
> let me understand what is doable / what are the costs on your side.
> 
> So we have a page cache page that you'd like to migrate to the device.
> Fine. You are willing to sacrifice direct IO - even better. We can fall
> back to buffered IO in that case (well, except for XFS which does not do it
> but that's a minor detail). One thing I'm not sure about: When a page is
> migrated to the device, is its contents available and is just possibly stale
> or will something bad happen if we try to access (or even modify) page data?

Well i am not ready to sacrifice anything :) the point is that high level
langage are evolving in direction in which they want to transparently use
device like GPU without the programmer knowledge so it is important that
all feature keeps working as if nothing is amiss.

Device behave exactly like CPU in respect to memory. They have a page table
and they have same kind of capabilities. So device will follow same rules.
When you start writeback you do page_mkclean() and this will be reflected
on the device too, it will write protect the page.

Moreover you can access the data at any time, device are cache coherent and
so when you use their dma engine to retrive page content you will get the
full page content and nothing can be stale (assuming that page is first
write protected).

> 
> And by migration you really mean page migration? Be aware that migration of
> pagecache pages may be a problem for some pages of some filesystems on its
> own - e. g. page migration may fail because there is a filesystem transaction
> outstanding modifying that page. For userspace these will be really hard
> to understand sporadic errors because it's really filesystem internal
> thing. So far page migration was widely used only for free space
> defragmentation and for that purpose if page is not migratable for a minute
> who cares.

I am aware that page migration can fail because a writeback is underway and
i am fine with it. When that happens either device wait or use the system
page directly (read only obviously as device obey read/write protection).

> 
> So won't it be easier to leave the pagecache page where it is and *copy* it
> to the device? Can the device notify us *before* it is going to modify a
> page, not just after it has modified it? Possibly if we just give it the
> page read-only and it will have to ask CPU to get write permission? If yes,
> then I belive this could work and even fs support should be doable.

Well yes and no. Device obey the same rule as CPU so if a file back page is
map read only in the process it must first do a write fault which will call
in the fs (page_mkwrite() of vm_ops). But once a page has write permission
there is no way to be notify by hardware on every write. First the hardware
do not have the capability. Second we are talking thousand (10 000 is upper
range in today device) of concurrent thread, each can possibly write to page
under consideration.

We really want the device page to behave just like regular page. Most fs code
path never map file content, it only happens during read/write and i believe
this can be handled either by migrating back or by using bounce page. I want
to provide the choice between the two solutions as one will be better for some
workload and the other for different workload.

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14 17:15             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 17:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Wed, Dec 14, 2016 at 12:13:51PM +0100, Jan Kara wrote:
> On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> > 
> > > 
> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> > 
> > > 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file. It is a regular file on your filesystem.
> 
> OK, so I share most of Dave's concerns about this. But let's talk about
> what we can do and what you need and we may find something usable. First
> let me understand what is doable / what are the costs on your side.
> 
> So we have a page cache page that you'd like to migrate to the device.
> Fine. You are willing to sacrifice direct IO - even better. We can fall
> back to buffered IO in that case (well, except for XFS which does not do it
> but that's a minor detail). One thing I'm not sure about: When a page is
> migrated to the device, is its contents available and is just possibly stale
> or will something bad happen if we try to access (or even modify) page data?

Well i am not ready to sacrifice anything :) the point is that high level
langage are evolving in direction in which they want to transparently use
device like GPU without the programmer knowledge so it is important that
all feature keeps working as if nothing is amiss.

Device behave exactly like CPU in respect to memory. They have a page table
and they have same kind of capabilities. So device will follow same rules.
When you start writeback you do page_mkclean() and this will be reflected
on the device too, it will write protect the page.

Moreover you can access the data at any time, device are cache coherent and
so when you use their dma engine to retrive page content you will get the
full page content and nothing can be stale (assuming that page is first
write protected).

> 
> And by migration you really mean page migration? Be aware that migration of
> pagecache pages may be a problem for some pages of some filesystems on its
> own - e. g. page migration may fail because there is a filesystem transaction
> outstanding modifying that page. For userspace these will be really hard
> to understand sporadic errors because it's really filesystem internal
> thing. So far page migration was widely used only for free space
> defragmentation and for that purpose if page is not migratable for a minute
> who cares.

I am aware that page migration can fail because a writeback is underway and
i am fine with it. When that happens either device wait or use the system
page directly (read only obviously as device obey read/write protection).

> 
> So won't it be easier to leave the pagecache page where it is and *copy* it
> to the device? Can the device notify us *before* it is going to modify a
> page, not just after it has modified it? Possibly if we just give it the
> page read-only and it will have to ask CPU to get write permission? If yes,
> then I belive this could work and even fs support should be doable.

Well yes and no. Device obey the same rule as CPU so if a file back page is
map read only in the process it must first do a write fault which will call
in the fs (page_mkwrite() of vm_ops). But once a page has write permission
there is no way to be notify by hardware on every write. First the hardware
do not have the capability. Second we are talking thousand (10 000 is upper
range in today device) of concurrent thread, each can possibly write to page
under consideration.

We really want the device page to behave just like regular page. Most fs code
path never map file content, it only happens during read/write and i believe
this can be handled either by migrating back or by using bounce page. I want
to provide the choice between the two solutions as one will be better for some
workload and the other for different workload.

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-14 17:15             ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-14 17:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Wed, Dec 14, 2016 at 12:13:51PM +0100, Jan Kara wrote:
> On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> > 
> > > 
> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> > 
> > > 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file. It is a regular file on your filesystem.
> 
> OK, so I share most of Dave's concerns about this. But let's talk about
> what we can do and what you need and we may find something usable. First
> let me understand what is doable / what are the costs on your side.
> 
> So we have a page cache page that you'd like to migrate to the device.
> Fine. You are willing to sacrifice direct IO - even better. We can fall
> back to buffered IO in that case (well, except for XFS which does not do it
> but that's a minor detail). One thing I'm not sure about: When a page is
> migrated to the device, is its contents available and is just possibly stale
> or will something bad happen if we try to access (or even modify) page data?

Well i am not ready to sacrifice anything :) the point is that high level
langage are evolving in direction in which they want to transparently use
device like GPU without the programmer knowledge so it is important that
all feature keeps working as if nothing is amiss.

Device behave exactly like CPU in respect to memory. They have a page table
and they have same kind of capabilities. So device will follow same rules.
When you start writeback you do page_mkclean() and this will be reflected
on the device too, it will write protect the page.

Moreover you can access the data at any time, device are cache coherent and
so when you use their dma engine to retrive page content you will get the
full page content and nothing can be stale (assuming that page is first
write protected).

> 
> And by migration you really mean page migration? Be aware that migration of
> pagecache pages may be a problem for some pages of some filesystems on its
> own - e. g. page migration may fail because there is a filesystem transaction
> outstanding modifying that page. For userspace these will be really hard
> to understand sporadic errors because it's really filesystem internal
> thing. So far page migration was widely used only for free space
> defragmentation and for that purpose if page is not migratable for a minute
> who cares.

I am aware that page migration can fail because a writeback is underway and
i am fine with it. When that happens either device wait or use the system
page directly (read only obviously as device obey read/write protection).

> 
> So won't it be easier to leave the pagecache page where it is and *copy* it
> to the device? Can the device notify us *before* it is going to modify a
> page, not just after it has modified it? Possibly if we just give it the
> page read-only and it will have to ask CPU to get write permission? If yes,
> then I belive this could work and even fs support should be doable.

Well yes and no. Device obey the same rule as CPU so if a file back page is
map read only in the process it must first do a write fault which will call
in the fs (page_mkwrite() of vm_ops). But once a page has write permission
there is no way to be notify by hardware on every write. First the hardware
do not have the capability. Second we are talking thousand (10 000 is upper
range in today device) of concurrent thread, each can possibly write to page
under consideration.

We really want the device page to behave just like regular page. Most fs code
path never map file content, it only happens during read/write and i believe
this can be handled either by migrating back or by using bounce page. I want
to provide the choice between the two solutions as one will be better for some
workload and the other for different workload.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14 17:15             ` Jerome Glisse
@ 2016-12-15 16:19               ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-15 16:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
<snipped explanation that the device has the same cabilities as CPUs wrt
page handling>

> > So won't it be easier to leave the pagecache page where it is and *copy* it
> > to the device? Can the device notify us *before* it is going to modify a
> > page, not just after it has modified it? Possibly if we just give it the
> > page read-only and it will have to ask CPU to get write permission? If yes,
> > then I belive this could work and even fs support should be doable.
> 
> Well yes and no. Device obey the same rule as CPU so if a file back page is
> map read only in the process it must first do a write fault which will call
> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> there is no way to be notify by hardware on every write. First the hardware
> do not have the capability. Second we are talking thousand (10 000 is upper
> range in today device) of concurrent thread, each can possibly write to page
> under consideration.

Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
notification which apparently it is. OK.

> We really want the device page to behave just like regular page. Most fs code
> path never map file content, it only happens during read/write and i believe
> this can be handled either by migrating back or by using bounce page. I want
> to provide the choice between the two solutions as one will be better for some
> workload and the other for different workload.

I agree with keeping page used by the device behaving as similar as
possible as any other page. I'm just exploring different possibilities how
to make that happen. E.g. the scheme I was aiming at is:

When you want page A to be used by the device, you set up page A' in the
device but make sure any access to it will fault.

When the device wants to access A', it notifies the CPU, that writeprotects
all mappings of A, copy A to A' and map A' read-only for the device.

When the device wants to write to A', it notifies CPU, that will clear all
mappings of A and mark A as not-uptodate & dirty. When the CPU will then
want to access the data in A again - we need to catch ->readpage,
->readpages, ->writepage, ->writepages - it will writeprotect A' in
the device, copy data to A, mark A as uptodate & dirty, and off we go.

When we want to write to the page on CPU - we get either wp fault if it was
via mmap, or we have to catch that in places using kmap() - we just remove
access to A' from the device.

This scheme makes the device mapping functionality transparent to the
filesystem (you actually don't need to hook directly into ->readpage etc.
handlers, you can just have wrappers around them for this functionality)
and fairly straightforward... It is so transparent that even direct IO works
with this since the page cache invalidation pass we do before actually doing
the direct IO will make sure to pull all the pages from the device and write
them to disk if needed. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-15 16:19               ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-15 16:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
<snipped explanation that the device has the same cabilities as CPUs wrt
page handling>

> > So won't it be easier to leave the pagecache page where it is and *copy* it
> > to the device? Can the device notify us *before* it is going to modify a
> > page, not just after it has modified it? Possibly if we just give it the
> > page read-only and it will have to ask CPU to get write permission? If yes,
> > then I belive this could work and even fs support should be doable.
> 
> Well yes and no. Device obey the same rule as CPU so if a file back page is
> map read only in the process it must first do a write fault which will call
> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> there is no way to be notify by hardware on every write. First the hardware
> do not have the capability. Second we are talking thousand (10 000 is upper
> range in today device) of concurrent thread, each can possibly write to page
> under consideration.

Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
notification which apparently it is. OK.

> We really want the device page to behave just like regular page. Most fs code
> path never map file content, it only happens during read/write and i believe
> this can be handled either by migrating back or by using bounce page. I want
> to provide the choice between the two solutions as one will be better for some
> workload and the other for different workload.

I agree with keeping page used by the device behaving as similar as
possible as any other page. I'm just exploring different possibilities how
to make that happen. E.g. the scheme I was aiming at is:

When you want page A to be used by the device, you set up page A' in the
device but make sure any access to it will fault.

When the device wants to access A', it notifies the CPU, that writeprotects
all mappings of A, copy A to A' and map A' read-only for the device.

When the device wants to write to A', it notifies CPU, that will clear all
mappings of A and mark A as not-uptodate & dirty. When the CPU will then
want to access the data in A again - we need to catch ->readpage,
->readpages, ->writepage, ->writepages - it will writeprotect A' in
the device, copy data to A, mark A as uptodate & dirty, and off we go.

When we want to write to the page on CPU - we get either wp fault if it was
via mmap, or we have to catch that in places using kmap() - we just remove
access to A' from the device.

This scheme makes the device mapping functionality transparent to the
filesystem (you actually don't need to hook directly into ->readpage etc.
handlers, you can just have wrappers around them for this functionality)
and fairly straightforward... It is so transparent that even direct IO works
with this since the page cache invalidation pass we do before actually doing
the direct IO will make sure to pull all the pages from the device and write
them to disk if needed. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-15 16:19               ` Jan Kara
  (?)
@ 2016-12-15 19:14                 ` Jerome Glisse
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-15 19:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
> 
> > > So won't it be easier to leave the pagecache page where it is and *copy* it
> > > to the device? Can the device notify us *before* it is going to modify a
> > > page, not just after it has modified it? Possibly if we just give it the
> > > page read-only and it will have to ask CPU to get write permission? If yes,
> > > then I belive this could work and even fs support should be doable.
> > 
> > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > map read only in the process it must first do a write fault which will call
> > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > there is no way to be notify by hardware on every write. First the hardware
> > do not have the capability. Second we are talking thousand (10 000 is upper
> > range in today device) of concurrent thread, each can possibly write to page
> > under consideration.
> 
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
> 
> > We really want the device page to behave just like regular page. Most fs code
> > path never map file content, it only happens during read/write and i believe
> > this can be handled either by migrating back or by using bounce page. I want
> > to provide the choice between the two solutions as one will be better for some
> > workload and the other for different workload.
> 
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
> 
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
> 
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.
> 
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
> 
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
> 
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?

This is do-able but i think it will require the same amount of changes than
what i had in mind (excluding the block bounce code) with one drawback. Doing
it that way we can not free page A.

On some workload this probably does not hurt much but on workload where you
read a big dataset from disk and then use it only on the GPU for long period
of time (minutes/hours) you will waste GB of system memory.

Right now i am working on some other patchset, i intend to take a stab at this
in January/February time frame, before summit so i can post an RFC and have a
clear picture of every code path that needs modifications. I expect this would
provide better frame for discussion.

I assume i will have to change >readpage >readpages writepage >writepages but
i think that the only place i really need to change are do_generic_file_read()
and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
fs that use those generic helpers.

I also probably will change >mmap or rather the helper it uses to set the pte
depending on what looks better.

Note that i don't think wrapping is an easy task. I would need to replace page
A mapping (struct page.mapping) to point to a wrapping address_space but there
is enough place in the kernel that directly dereference that and expect to hit
the right (real) address_space. I would need to replace all dereference of
page->mapping to an helper function and possibly would need to change some of
the call site logic accordingly. This might prove a bigger change than just
having to use bounce in do_generic_file_read() and generic_perform_write().

Cheers,
Jï¿½rï¿½me

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-15 19:14                 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-15 19:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
> 
> > > So won't it be easier to leave the pagecache page where it is and *copy* it
> > > to the device? Can the device notify us *before* it is going to modify a
> > > page, not just after it has modified it? Possibly if we just give it the
> > > page read-only and it will have to ask CPU to get write permission? If yes,
> > > then I belive this could work and even fs support should be doable.
> > 
> > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > map read only in the process it must first do a write fault which will call
> > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > there is no way to be notify by hardware on every write. First the hardware
> > do not have the capability. Second we are talking thousand (10 000 is upper
> > range in today device) of concurrent thread, each can possibly write to page
> > under consideration.
> 
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
> 
> > We really want the device page to behave just like regular page. Most fs code
> > path never map file content, it only happens during read/write and i believe
> > this can be handled either by migrating back or by using bounce page. I want
> > to provide the choice between the two solutions as one will be better for some
> > workload and the other for different workload.
> 
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
> 
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
> 
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.
> 
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
> 
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
> 
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?

This is do-able but i think it will require the same amount of changes than
what i had in mind (excluding the block bounce code) with one drawback. Doing
it that way we can not free page A.

On some workload this probably does not hurt much but on workload where you
read a big dataset from disk and then use it only on the GPU for long period
of time (minutes/hours) you will waste GB of system memory.

Right now i am working on some other patchset, i intend to take a stab at this
in January/February time frame, before summit so i can post an RFC and have a
clear picture of every code path that needs modifications. I expect this would
provide better frame for discussion.

I assume i will have to change >readpage >readpages writepage >writepages but
i think that the only place i really need to change are do_generic_file_read()
and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
fs that use those generic helpers.

I also probably will change >mmap or rather the helper it uses to set the pte
depending on what looks better.

Note that i don't think wrapping is an easy task. I would need to replace page
A mapping (struct page.mapping) to point to a wrapping address_space but there
is enough place in the kernel that directly dereference that and expect to hit
the right (real) address_space. I would need to replace all dereference of
page->mapping to an helper function and possibly would need to change some of
the call site logic accordingly. This might prove a bigger change than just
having to use bounce in do_generic_file_read() and generic_perform_write().

Cheers,
Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-15 19:14                 ` Jerome Glisse
  0 siblings, 0 replies; 75+ messages in thread
From: Jerome Glisse @ 2016-12-15 19:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
> 
> > > So won't it be easier to leave the pagecache page where it is and *copy* it
> > > to the device? Can the device notify us *before* it is going to modify a
> > > page, not just after it has modified it? Possibly if we just give it the
> > > page read-only and it will have to ask CPU to get write permission? If yes,
> > > then I belive this could work and even fs support should be doable.
> > 
> > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > map read only in the process it must first do a write fault which will call
> > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > there is no way to be notify by hardware on every write. First the hardware
> > do not have the capability. Second we are talking thousand (10 000 is upper
> > range in today device) of concurrent thread, each can possibly write to page
> > under consideration.
> 
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
> 
> > We really want the device page to behave just like regular page. Most fs code
> > path never map file content, it only happens during read/write and i believe
> > this can be handled either by migrating back or by using bounce page. I want
> > to provide the choice between the two solutions as one will be better for some
> > workload and the other for different workload.
> 
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
> 
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
> 
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.
> 
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
> 
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
> 
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?

This is do-able but i think it will require the same amount of changes than
what i had in mind (excluding the block bounce code) with one drawback. Doing
it that way we can not free page A.

On some workload this probably does not hurt much but on workload where you
read a big dataset from disk and then use it only on the GPU for long period
of time (minutes/hours) you will waste GB of system memory.

Right now i am working on some other patchset, i intend to take a stab at this
in January/February time frame, before summit so i can post an RFC and have a
clear picture of every code path that needs modifications. I expect this would
provide better frame for discussion.

I assume i will have to change >readpage >readpages writepage >writepages but
i think that the only place i really need to change are do_generic_file_read()
and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
fs that use those generic helpers.

I also probably will change >mmap or rather the helper it uses to set the pte
depending on what looks better.

Note that i don't think wrapping is an easy task. I would need to replace page
A mapping (struct page.mapping) to point to a wrapping address_space but there
is enough place in the kernel that directly dereference that and expect to hit
the right (real) address_space. I would need to replace all dereference of
page->mapping to an helper function and possibly would need to change some of
the call site logic accordingly. This might prove a bigger change than just
having to use bounce in do_generic_file_read() and generic_perform_write().

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-15 16:19               ` Jan Kara
  (?)
@ 2016-12-16  3:10                 ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-16  3:10 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

Jan Kara <jack@suse.cz> writes:

> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
>
>> > So won't it be easier to leave the pagecache page where it is and *copy* it
>> > to the device? Can the device notify us *before* it is going to modify a
>> > page, not just after it has modified it? Possibly if we just give it the
>> > page read-only and it will have to ask CPU to get write permission? If yes,
>> > then I belive this could work and even fs support should be doable.
>> 
>> Well yes and no. Device obey the same rule as CPU so if a file back page is
>> map read only in the process it must first do a write fault which will call
>> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
>> there is no way to be notify by hardware on every write. First the hardware
>> do not have the capability. Second we are talking thousand (10 000 is upper
>> range in today device) of concurrent thread, each can possibly write to page
>> under consideration.
>
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
>
>> We really want the device page to behave just like regular page. Most fs code
>> path never map file content, it only happens during read/write and i believe
>> this can be handled either by migrating back or by using bounce page. I want
>> to provide the choice between the two solutions as one will be better for some
>> workload and the other for different workload.
>
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
>
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
>
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.


A and A' will have different pfns here and hence different struct page.
So what will be there in the address_space->page_tree ? If we place
A' in the page cache, then we are essentially bringing lot of locking
complexity Dave talked about in previous mails.

>
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
>
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
>
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?
>

-aneesh


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-16  3:10                 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-16  3:10 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

Jan Kara <jack@suse.cz> writes:

> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
>
>> > So won't it be easier to leave the pagecache page where it is and *copy* it
>> > to the device? Can the device notify us *before* it is going to modify a
>> > page, not just after it has modified it? Possibly if we just give it the
>> > page read-only and it will have to ask CPU to get write permission? If yes,
>> > then I belive this could work and even fs support should be doable.
>> 
>> Well yes and no. Device obey the same rule as CPU so if a file back page is
>> map read only in the process it must first do a write fault which will call
>> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
>> there is no way to be notify by hardware on every write. First the hardware
>> do not have the capability. Second we are talking thousand (10 000 is upper
>> range in today device) of concurrent thread, each can possibly write to page
>> under consideration.
>
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
>
>> We really want the device page to behave just like regular page. Most fs code
>> path never map file content, it only happens during read/write and i believe
>> this can be handled either by migrating back or by using bounce page. I want
>> to provide the choice between the two solutions as one will be better for some
>> workload and the other for different workload.
>
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
>
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
>
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.


A and A' will have different pfns here and hence different struct page.
So what will be there in the address_space->page_tree ? If we place
A' in the page cache, then we are essentially bringing lot of locking
complexity Dave talked about in previous mails.

>
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
>
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
>
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?
>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-16  3:10                 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-16  3:10 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

Jan Kara <jack@suse.cz> writes:

> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> <snipped explanation that the device has the same cabilities as CPUs wrt
> page handling>
>
>> > So won't it be easier to leave the pagecache page where it is and *copy* it
>> > to the device? Can the device notify us *before* it is going to modify a
>> > page, not just after it has modified it? Possibly if we just give it the
>> > page read-only and it will have to ask CPU to get write permission? If yes,
>> > then I belive this could work and even fs support should be doable.
>> 
>> Well yes and no. Device obey the same rule as CPU so if a file back page is
>> map read only in the process it must first do a write fault which will call
>> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
>> there is no way to be notify by hardware on every write. First the hardware
>> do not have the capability. Second we are talking thousand (10 000 is upper
>> range in today device) of concurrent thread, each can possibly write to page
>> under consideration.
>
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
>
>> We really want the device page to behave just like regular page. Most fs code
>> path never map file content, it only happens during read/write and i believe
>> this can be handled either by migrating back or by using bounce page. I want
>> to provide the choice between the two solutions as one will be better for some
>> workload and the other for different workload.
>
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
>
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
>
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.


A and A' will have different pfns here and hence different struct page.
So what will be there in the address_space->page_tree ? If we place
A' in the page cache, then we are essentially bringing lot of locking
complexity Dave talked about in previous mails.

>
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
>
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
>
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?
>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
  2016-12-13 18:15 ` Jerome Glisse
@ 2016-12-16  3:14   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-16  3:14 UTC (permalink / raw)
  To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

Jerome Glisse <jglisse@redhat.com> writes:

> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.
>
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this is
> worth discussing during summit see how people feels about such plan and
> if they have better ideas.
>
>
> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions

I would like to attend this discussion. I can talk about coherent device
memory and how having HMM handle that will make it easy to have one
interface for device driver. For Coherent device case we definitely need
page cache migration support.

-aneesh


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
@ 2016-12-16  3:14   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-16  3:14 UTC (permalink / raw)
  To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

Jerome Glisse <jglisse@redhat.com> writes:

> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.
>
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this is
> worth discussing during summit see how people feels about such plan and
> if they have better ideas.
>
>
> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions

I would like to attend this discussion. I can talk about coherent device
memory and how having HMM handle that will make it easy to have one
interface for device driver. For Coherent device case we definitely need
page cache migration support.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-15 19:14                 ` Jerome Glisse
@ 2016-12-16  8:14                   ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-16  8:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Thu 15-12-16 14:14:53, Jerome Glisse wrote:
> On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> > <snipped explanation that the device has the same cabilities as CPUs wrt
> > page handling>
> > 
> > > > So won't it be easier to leave the pagecache page where it is and *copy* it
> > > > to the device? Can the device notify us *before* it is going to modify a
> > > > page, not just after it has modified it? Possibly if we just give it the
> > > > page read-only and it will have to ask CPU to get write permission? If yes,
> > > > then I belive this could work and even fs support should be doable.
> > > 
> > > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > > map read only in the process it must first do a write fault which will call
> > > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > > there is no way to be notify by hardware on every write. First the hardware
> > > do not have the capability. Second we are talking thousand (10 000 is upper
> > > range in today device) of concurrent thread, each can possibly write to page
> > > under consideration.
> > 
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> > 
> > > We really want the device page to behave just like regular page. Most fs code
> > > path never map file content, it only happens during read/write and i believe
> > > this can be handled either by migrating back or by using bounce page. I want
> > > to provide the choice between the two solutions as one will be better for some
> > > workload and the other for different workload.
> > 
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> > 
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> > 
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> > 
> > When the device wants to write to A', it notifies CPU, that will clear all
> > mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> > want to access the data in A again - we need to catch ->readpage,
> > ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> > the device, copy data to A, mark A as uptodate & dirty, and off we go.
> > 
> > When we want to write to the page on CPU - we get either wp fault if it was
> > via mmap, or we have to catch that in places using kmap() - we just remove
> > access to A' from the device.
> > 
> > This scheme makes the device mapping functionality transparent to the
> > filesystem (you actually don't need to hook directly into ->readpage etc.
> > handlers, you can just have wrappers around them for this functionality)
> > and fairly straightforward... It is so transparent that even direct IO works
> > with this since the page cache invalidation pass we do before actually doing
> > the direct IO will make sure to pull all the pages from the device and write
> > them to disk if needed. What do you think?
> 
> This is do-able but i think it will require the same amount of changes than
> what i had in mind (excluding the block bounce code) with one drawback. Doing
> it that way we can not free page A.

I guess I'd have to see code implementing your approach to be able to judge
what ends up being less code - the devil is in the details here I believe.
Actually, when thinking about it with a fresh mind, I don't think we'd have
to catch kmap() at all with my approach - all writes could be cached either
in grab_cache_page_write_begin() or in page_mkwrite(). What I like about my
solution is that it is completely fs agnostic and the places that need
handling of device pages have very relaxed locking constraints - grabbing
locks necessary to update mappings / communicate with the device should be
no brainer in those contexts.

> On some workload this probably does not hurt much but on workload where you
> read a big dataset from disk and then use it only on the GPU for long period
> of time (minutes/hours) you will waste GB of system memory.

I was thinking about this as well. So you could just leave the page A to be
undergoing normal page aging and reclaim. However what you need is to
somehow maintain the information that index I in file F is mapped to the
device's page A' so that ->readpage() and friends know they should pull the
page from the device and not from disk. Traditionally we do this by
exceptional entries in the radix tree - i.e., when we reclaim A, we do not
insert shadow exceptional entry into the radix tree telling when the page
was evicted but instead insert there exceptional entry telling this page
is stored in the device.

> Right now i am working on some other patchset, i intend to take a stab at this
> in January/February time frame, before summit so i can post an RFC and have a
> clear picture of every code path that needs modifications. I expect this would
> provide better frame for discussion.

Yeah, that sounds good.

> I assume i will have to change >readpage >readpages writepage >writepages but
> i think that the only place i really need to change are do_generic_file_read()
> and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
> fs that use those generic helpers.

Not really. There is other stuff that can be pulling pagecache pages in
memory - e.g. think of readahead, or page faults, or page fault around
logic, or splice, or ...
 
> I also probably will change >mmap or rather the helper it uses to set the pte
> depending on what looks better.
> 
> Note that i don't think wrapping is an easy task. I would need to replace page
> A mapping (struct page.mapping) to point to a wrapping address_space but there
> is enough place in the kernel that directly dereference that and expect to hit
> the right (real) address_space. I would need to replace all dereference of
> page->mapping to an helper function and possibly would need to change some of
> the call site logic accordingly. This might prove a bigger change than just
> having to use bounce in do_generic_file_read() and generic_perform_write().

So what I meant by wrapping is that you'd wrap places that call ->readpage,
->readpages, ->writepage, ->writepages with a helper function that will do
what you need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-16  8:14                   ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-16  8:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

On Thu 15-12-16 14:14:53, Jerome Glisse wrote:
> On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> > <snipped explanation that the device has the same cabilities as CPUs wrt
> > page handling>
> > 
> > > > So won't it be easier to leave the pagecache page where it is and *copy* it
> > > > to the device? Can the device notify us *before* it is going to modify a
> > > > page, not just after it has modified it? Possibly if we just give it the
> > > > page read-only and it will have to ask CPU to get write permission? If yes,
> > > > then I belive this could work and even fs support should be doable.
> > > 
> > > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > > map read only in the process it must first do a write fault which will call
> > > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > > there is no way to be notify by hardware on every write. First the hardware
> > > do not have the capability. Second we are talking thousand (10 000 is upper
> > > range in today device) of concurrent thread, each can possibly write to page
> > > under consideration.
> > 
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> > 
> > > We really want the device page to behave just like regular page. Most fs code
> > > path never map file content, it only happens during read/write and i believe
> > > this can be handled either by migrating back or by using bounce page. I want
> > > to provide the choice between the two solutions as one will be better for some
> > > workload and the other for different workload.
> > 
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> > 
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> > 
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> > 
> > When the device wants to write to A', it notifies CPU, that will clear all
> > mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> > want to access the data in A again - we need to catch ->readpage,
> > ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> > the device, copy data to A, mark A as uptodate & dirty, and off we go.
> > 
> > When we want to write to the page on CPU - we get either wp fault if it was
> > via mmap, or we have to catch that in places using kmap() - we just remove
> > access to A' from the device.
> > 
> > This scheme makes the device mapping functionality transparent to the
> > filesystem (you actually don't need to hook directly into ->readpage etc.
> > handlers, you can just have wrappers around them for this functionality)
> > and fairly straightforward... It is so transparent that even direct IO works
> > with this since the page cache invalidation pass we do before actually doing
> > the direct IO will make sure to pull all the pages from the device and write
> > them to disk if needed. What do you think?
> 
> This is do-able but i think it will require the same amount of changes than
> what i had in mind (excluding the block bounce code) with one drawback. Doing
> it that way we can not free page A.

I guess I'd have to see code implementing your approach to be able to judge
what ends up being less code - the devil is in the details here I believe.
Actually, when thinking about it with a fresh mind, I don't think we'd have
to catch kmap() at all with my approach - all writes could be cached either
in grab_cache_page_write_begin() or in page_mkwrite(). What I like about my
solution is that it is completely fs agnostic and the places that need
handling of device pages have very relaxed locking constraints - grabbing
locks necessary to update mappings / communicate with the device should be
no brainer in those contexts.

> On some workload this probably does not hurt much but on workload where you
> read a big dataset from disk and then use it only on the GPU for long period
> of time (minutes/hours) you will waste GB of system memory.

I was thinking about this as well. So you could just leave the page A to be
undergoing normal page aging and reclaim. However what you need is to
somehow maintain the information that index I in file F is mapped to the
device's page A' so that ->readpage() and friends know they should pull the
page from the device and not from disk. Traditionally we do this by
exceptional entries in the radix tree - i.e., when we reclaim A, we do not
insert shadow exceptional entry into the radix tree telling when the page
was evicted but instead insert there exceptional entry telling this page
is stored in the device.

> Right now i am working on some other patchset, i intend to take a stab at this
> in January/February time frame, before summit so i can post an RFC and have a
> clear picture of every code path that needs modifications. I expect this would
> provide better frame for discussion.

Yeah, that sounds good.

> I assume i will have to change >readpage >readpages writepage >writepages but
> i think that the only place i really need to change are do_generic_file_read()
> and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
> fs that use those generic helpers.

Not really. There is other stuff that can be pulling pagecache pages in
memory - e.g. think of readahead, or page faults, or page fault around
logic, or splice, or ...
 
> I also probably will change >mmap or rather the helper it uses to set the pte
> depending on what looks better.
> 
> Note that i don't think wrapping is an easy task. I would need to replace page
> A mapping (struct page.mapping) to point to a wrapping address_space but there
> is enough place in the kernel that directly dereference that and expect to hit
> the right (real) address_space. I would need to replace all dereference of
> page->mapping to an helper function and possibly would need to change some of
> the call site logic accordingly. This might prove a bigger change than just
> having to use bounce in do_generic_file_read() and generic_perform_write().

So what I meant by wrapping is that you'd wrap places that call ->readpage,
->readpages, ->writepage, ->writepages with a helper function that will do
what you need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-16  3:10                 ` Aneesh Kumar K.V
@ 2016-12-19  8:46                   ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-19  8:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, Jerome Glisse, Dave Chinner, linux-block, linux-mm,
	lsf-pc, linux-fsdevel

On Fri 16-12-16 08:40:38, Aneesh Kumar K.V wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> > <snipped explanation that the device has the same cabilities as CPUs wrt
> > page handling>
> >
> >> > So won't it be easier to leave the pagecache page where it is and *copy* it
> >> > to the device? Can the device notify us *before* it is going to modify a
> >> > page, not just after it has modified it? Possibly if we just give it the
> >> > page read-only and it will have to ask CPU to get write permission? If yes,
> >> > then I belive this could work and even fs support should be doable.
> >> 
> >> Well yes and no. Device obey the same rule as CPU so if a file back page is
> >> map read only in the process it must first do a write fault which will call
> >> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> >> there is no way to be notify by hardware on every write. First the hardware
> >> do not have the capability. Second we are talking thousand (10 000 is upper
> >> range in today device) of concurrent thread, each can possibly write to page
> >> under consideration.
> >
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> >
> >> We really want the device page to behave just like regular page. Most fs code
> >> path never map file content, it only happens during read/write and i believe
> >> this can be handled either by migrating back or by using bounce page. I want
> >> to provide the choice between the two solutions as one will be better for some
> >> workload and the other for different workload.
> >
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> >
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> >
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> 
> 
> A and A' will have different pfns here and hence different struct page.

Yes. In fact I don't think there's need to have struct page for A' in my
scheme. At least for the purposes of page cache tracking... Maybe there's
good reason to have it from a device driver POV.

> So what will be there in the address_space->page_tree ? If we place
> A' in the page cache, then we are essentially bringing lot of locking
> complexity Dave talked about in previous mails.

No, I meant page A will stay in the page_tree. There's no need for
migration in my scheme.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-19  8:46                   ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-12-19  8:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, Jerome Glisse, Dave Chinner, linux-block, linux-mm,
	lsf-pc, linux-fsdevel

On Fri 16-12-16 08:40:38, Aneesh Kumar K.V wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> > <snipped explanation that the device has the same cabilities as CPUs wrt
> > page handling>
> >
> >> > So won't it be easier to leave the pagecache page where it is and *copy* it
> >> > to the device? Can the device notify us *before* it is going to modify a
> >> > page, not just after it has modified it? Possibly if we just give it the
> >> > page read-only and it will have to ask CPU to get write permission? If yes,
> >> > then I belive this could work and even fs support should be doable.
> >> 
> >> Well yes and no. Device obey the same rule as CPU so if a file back page is
> >> map read only in the process it must first do a write fault which will call
> >> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> >> there is no way to be notify by hardware on every write. First the hardware
> >> do not have the capability. Second we are talking thousand (10 000 is upper
> >> range in today device) of concurrent thread, each can possibly write to page
> >> under consideration.
> >
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> >
> >> We really want the device page to behave just like regular page. Most fs code
> >> path never map file content, it only happens during read/write and i believe
> >> this can be handled either by migrating back or by using bounce page. I want
> >> to provide the choice between the two solutions as one will be better for some
> >> workload and the other for different workload.
> >
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> >
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> >
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> 
> 
> A and A' will have different pfns here and hence different struct page.

Yes. In fact I don't think there's need to have struct page for A' in my
scheme. At least for the purposes of page cache tracking... Maybe there's
good reason to have it from a device driver POV.

> So what will be there in the address_space->page_tree ? If we place
> A' in the page cache, then we are essentially bringing lot of locking
> complexity Dave talked about in previous mails.

No, I meant page A will stay in the page_tree. There's no need for
migration in my scheme.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
  2016-12-14 11:13           ` Jan Kara
@ 2016-12-19 17:00             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-19 17:00 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

Jan Kara <jack@suse.cz> writes:

> On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
>> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
>> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
>> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
>> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
>> > > > > I would like to discuss un-addressable device memory in the context of
>> > > > > filesystem and block device. Specificaly how to handle write-back, read,
>> > > > > ... when a filesystem page is migrated to device memory that CPU can not
>> > > > > access.
>> > > > 
>> > > > You mean pmem that is DAX-capable that suddenly, without warning,
>> > > > becomes non-DAX capable?
>> > > > 
>> > > > If you are not talking about pmem and DAX, then exactly what does
>> > > > "when a filesystem page is migrated to device memory that CPU can
>> > > > not access" mean? What "filesystem page" are we talking about that
>> > > > can get migrated from main RAM to something the CPU can't access?
>> > > 
>> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
>> > > board memory that can not be expose transparently to the CPU. I am
>> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
>> > > https://lwn.net/Articles/706856/
>> > 
>> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
>> 
>> Well not only target, it can be source too. But the device can read
>> and write any system memory and dma to/from that memory to its on
>> board memory.
>> 
>> > 
>> > > So in my case i am only considering non DAX/PMEM filesystem ie any
>> > > "regular" filesystem back by a "regular" block device. I want to be
>> > > able to migrate mmaped area of such filesystem to device memory while
>> > > the device is actively using that memory.
>> > 
>> > "migrate mmapped area of such filesystem" means what, exactly?
>> 
>> fd = open("/path/to/some/file")
>> ptr = mmap(fd, ...);
>> gpu_compute_something(ptr);
>> 
>> > 
>> > Are you talking about file data contents that have been copied into
>> > the page cache and mmapped into a user process address space?
>> > IOWs, migrating ZONE_NORMAL page cache page content and state
>> > to a new ZONE_DEVICE page, and then migrating back again somehow?
>> 
>> Take any existing application that mmap a file and allow to migrate
>> chunk of that mmaped file to device memory without the application
>> even knowing about it. So nothing special in respect to that mmaped
>> file. It is a regular file on your filesystem.
>
> OK, so I share most of Dave's concerns about this. But let's talk about
> what we can do and what you need and we may find something usable. First
> let me understand what is doable / what are the costs on your side.
>
> So we have a page cache page that you'd like to migrate to the device.
> Fine. You are willing to sacrifice direct IO - even better. We can fall
> back to buffered IO in that case (well, except for XFS which does not do it
> but that's a minor detail). One thing I'm not sure about: When a page is
> migrated to the device, is its contents available and is just possibly stale
> or will something bad happen if we try to access (or even modify) page data?

For Coherent Device Memory case, the CPU can continue to access these
device pages.


>
> And by migration you really mean page migration? Be aware that migration of
> pagecache pages may be a problem for some pages of some filesystems on its
> own - e. g. page migration may fail because there is a filesystem transaction
> outstanding modifying that page. For userspace these will be really hard
> to understand sporadic errors because it's really filesystem internal
> thing. So far page migration was widely used only for free space
> defragmentation and for that purpose if page is not migratable for a minute
> who cares.

On the device driver side, i guess we should be able to handle page
migration failures and retry. For the reverse, i guess we need the
guarantee that a CPU access can always migrate back these pages without
failures ? Are there failure condition we need to handle when migrating
pages back to system memory ?


>
> So won't it be easier to leave the pagecache page where it is and *copy* it
> to the device? Can the device notify us *before* it is going to modify a
> page, not just after it has modified it? Possibly if we just give it the
> page read-only and it will have to ask CPU to get write permission? If yes,
> then I belive this could work and even fs support should be doable.
>

For coherent device memory scenario, we can live with one copy and both
cpu/device can access these pages. In CDM case the decision to migrate
is driven by the frequency of access from the device.

-aneesh


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
@ 2016-12-19 17:00             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 75+ messages in thread
From: Aneesh Kumar K.V @ 2016-12-19 17:00 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel

Jan Kara <jack@suse.cz> writes:

> On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
>> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
>> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
>> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
>> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
>> > > > > I would like to discuss un-addressable device memory in the context of
>> > > > > filesystem and block device. Specificaly how to handle write-back, read,
>> > > > > ... when a filesystem page is migrated to device memory that CPU can not
>> > > > > access.
>> > > > 
>> > > > You mean pmem that is DAX-capable that suddenly, without warning,
>> > > > becomes non-DAX capable?
>> > > > 
>> > > > If you are not talking about pmem and DAX, then exactly what does
>> > > > "when a filesystem page is migrated to device memory that CPU can
>> > > > not access" mean? What "filesystem page" are we talking about that
>> > > > can get migrated from main RAM to something the CPU can't access?
>> > > 
>> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
>> > > board memory that can not be expose transparently to the CPU. I am
>> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
>> > > https://lwn.net/Articles/706856/
>> > 
>> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
>> 
>> Well not only target, it can be source too. But the device can read
>> and write any system memory and dma to/from that memory to its on
>> board memory.
>> 
>> > 
>> > > So in my case i am only considering non DAX/PMEM filesystem ie any
>> > > "regular" filesystem back by a "regular" block device. I want to be
>> > > able to migrate mmaped area of such filesystem to device memory while
>> > > the device is actively using that memory.
>> > 
>> > "migrate mmapped area of such filesystem" means what, exactly?
>> 
>> fd = open("/path/to/some/file")
>> ptr = mmap(fd, ...);
>> gpu_compute_something(ptr);
>> 
>> > 
>> > Are you talking about file data contents that have been copied into
>> > the page cache and mmapped into a user process address space?
>> > IOWs, migrating ZONE_NORMAL page cache page content and state
>> > to a new ZONE_DEVICE page, and then migrating back again somehow?
>> 
>> Take any existing application that mmap a file and allow to migrate
>> chunk of that mmaped file to device memory without the application
>> even knowing about it. So nothing special in respect to that mmaped
>> file. It is a regular file on your filesystem.
>
> OK, so I share most of Dave's concerns about this. But let's talk about
> what we can do and what you need and we may find something usable. First
> let me understand what is doable / what are the costs on your side.
>
> So we have a page cache page that you'd like to migrate to the device.
> Fine. You are willing to sacrifice direct IO - even better. We can fall
> back to buffered IO in that case (well, except for XFS which does not do it
> but that's a minor detail). One thing I'm not sure about: When a page is
> migrated to the device, is its contents available and is just possibly stale
> or will something bad happen if we try to access (or even modify) page data?

For Coherent Device Memory case, the CPU can continue to access these
device pages.


>
> And by migration you really mean page migration? Be aware that migration of
> pagecache pages may be a problem for some pages of some filesystems on its
> own - e. g. page migration may fail because there is a filesystem transaction
> outstanding modifying that page. For userspace these will be really hard
> to understand sporadic errors because it's really filesystem internal
> thing. So far page migration was widely used only for free space
> defragmentation and for that purpose if page is not migratable for a minute
> who cares.

On the device driver side, i guess we should be able to handle page
migration failures and retry. For the reverse, i guess we need the
guarantee that a CPU access can always migrate back these pages without
failures ? Are there failure condition we need to handle when migrating
pages back to system memory ?


>
> So won't it be easier to leave the pagecache page where it is and *copy* it
> to the device? Can the device notify us *before* it is going to modify a
> page, not just after it has modified it? Possibly if we just give it the
> page read-only and it will have to ask CPU to get write permission? If yes,
> then I belive this could work and even fs support should be doable.
>

For coherent device memory scenario, we can live with one copy and both
cpu/device can access these pages. In CDM case the decision to migrate
is driven by the frequency of access from the device.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
  2016-12-16  3:14   ` Aneesh Kumar K.V
@ 2017-01-16 12:04     ` Anshuman Khandual
  -1 siblings, 0 replies; 75+ messages in thread
From: Anshuman Khandual @ 2017-01-16 12:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jerome Glisse, lsf-pc, linux-mm, linux-block,
	linux-fsdevel

On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote:
> Jerome Glisse <jglisse@redhat.com> writes:
> 
>> I would like to discuss un-addressable device memory in the context of
>> filesystem and block device. Specificaly how to handle write-back, read,
>> ... when a filesystem page is migrated to device memory that CPU can not
>> access.
>>
>> I intend to post a patchset leveraging the same idea as the existing
>> block bounce helper (block/bounce.c) to handle this. I believe this is
>> worth discussing during summit see how people feels about such plan and
>> if they have better ideas.
>>
>>
>> I also like to join discussions on:
>>   - Peer-to-Peer DMAs between PCIe devices
>>   - CDM coherent device memory
>>   - PMEM
>>   - overall mm discussions
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

I have been in the discussion on the mailing list about HMM since V13 which
got posted back in October. Touched upon many points including how it changes
ZONE_DEVICE to accommodate un-addressable device memory, migration capability
of currently supported ZONE_DEVICE based persistent memory etc. Looked at the
HMM more closely from the perspective whether it can also accommodate coherent
device memory which has been already discussed by others on this thread. I too
would like to attend to discuss more on this topic.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
@ 2017-01-16 12:04     ` Anshuman Khandual
  0 siblings, 0 replies; 75+ messages in thread
From: Anshuman Khandual @ 2017-01-16 12:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jerome Glisse, lsf-pc, linux-mm, linux-block,
	linux-fsdevel

On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote:
> Jerome Glisse <jglisse@redhat.com> writes:
> 
>> I would like to discuss un-addressable device memory in the context of
>> filesystem and block device. Specificaly how to handle write-back, read,
>> ... when a filesystem page is migrated to device memory that CPU can not
>> access.
>>
>> I intend to post a patchset leveraging the same idea as the existing
>> block bounce helper (block/bounce.c) to handle this. I believe this is
>> worth discussing during summit see how people feels about such plan and
>> if they have better ideas.
>>
>>
>> I also like to join discussions on:
>>   - Peer-to-Peer DMAs between PCIe devices
>>   - CDM coherent device memory
>>   - PMEM
>>   - overall mm discussions
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

I have been in the discussion on the mailing list about HMM since V13 which
got posted back in October. Touched upon many points including how it changes
ZONE_DEVICE to accommodate un-addressable device memory, migration capability
of currently supported ZONE_DEVICE based persistent memory etc. Looked at the
HMM more closely from the perspective whether it can also accommodate coherent
device memory which has been already discussed by others on this thread. I too
would like to attend to discuss more on this topic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
  2017-01-16 12:04     ` Anshuman Khandual
@ 2017-01-16 23:15       ` John Hubbard
  -1 siblings, 0 replies; 75+ messages in thread
From: John Hubbard @ 2017-01-16 23:15 UTC (permalink / raw)
  To: Anshuman Khandual, Aneesh Kumar K.V, Jerome Glisse, lsf-pc,
	linux-mm, linux-block, linux-fsdevel



On 01/16/2017 04:04 AM, Anshuman Khandual wrote:
> On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote:
>> Jerome Glisse <jglisse@redhat.com> writes:
>>
>>> I would like to discuss un-addressable device memory in the context of
>>> filesystem and block device. Specificaly how to handle write-back, read,
>>> ... when a filesystem page is migrated to device memory that CPU can not
>>> access.
>>>
>>> I intend to post a patchset leveraging the same idea as the existing
>>> block bounce helper (block/bounce.c) to handle this. I believe this is
>>> worth discussing during summit see how people feels about such plan and
>>> if they have better ideas.
>>>
>>>
>>> I also like to join discussions on:
>>>   - Peer-to-Peer DMAs between PCIe devices

Yes! This is looming large, because we keep insisting on building new computers with a *lot* of GPUs 
in them, and then connect them up with NICs as well, and oddly enough, people keep trying to do 
pee-to-peer between GPUs, and from GPUs to NICs, etc. :)  It feels like the linux-rdma and linux-pci 
discussions in the past sort of stalled, due to not being certain of the long-term direction of the 
design. So it's worth coming up with that.



>>>   - CDM coherent device memory
>>>   - PMEM
>>>   - overall mm discussions
>> I would like to attend this discussion. I can talk about coherent device
>> memory and how having HMM handle that will make it easy to have one
>> interface for device driver. For Coherent device case we definitely need
>> page cache migration support.
>
> I have been in the discussion on the mailing list about HMM since V13 which
> got posted back in October. Touched upon many points including how it changes
> ZONE_DEVICE to accommodate un-addressable device memory, migration capability
> of currently supported ZONE_DEVICE based persistent memory etc. Looked at the
> HMM more closely from the perspective whether it can also accommodate coherent
> device memory which has been already discussed by others on this thread. I too
> would like to attend to discuss more on this topic.

Also, on the huge page points (mentioned early in this short thread): some of our GPUs could, at 
times, match the CPU's large/huge page sizes. It is a delicate thing to achieve, but moving around, 
say, 2 MB pages between CPU and GPU would be, for some workloads, really fast.

I should be able to present performance numbers for HMM on Pascal GPUs, so if anyone would like 
that, let me know in advance of any particular workloads or configurations that seem most 
interesting, and I'll gather that.

Also would like to attend this one.

thanks
John Hubbard
NVIDIA

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
@ 2017-01-16 23:15       ` John Hubbard
  0 siblings, 0 replies; 75+ messages in thread
From: John Hubbard @ 2017-01-16 23:15 UTC (permalink / raw)
  To: Anshuman Khandual, Aneesh Kumar K.V, Jerome Glisse, lsf-pc,
	linux-mm, linux-block, linux-fsdevel



On 01/16/2017 04:04 AM, Anshuman Khandual wrote:
> On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote:
>> Jerome Glisse <jglisse@redhat.com> writes:
>>
>>> I would like to discuss un-addressable device memory in the context of
>>> filesystem and block device. Specificaly how to handle write-back, read,
>>> ... when a filesystem page is migrated to device memory that CPU can not
>>> access.
>>>
>>> I intend to post a patchset leveraging the same idea as the existing
>>> block bounce helper (block/bounce.c) to handle this. I believe this is
>>> worth discussing during summit see how people feels about such plan and
>>> if they have better ideas.
>>>
>>>
>>> I also like to join discussions on:
>>>   - Peer-to-Peer DMAs between PCIe devices

Yes! This is looming large, because we keep insisting on building new computers with a *lot* of GPUs 
in them, and then connect them up with NICs as well, and oddly enough, people keep trying to do 
pee-to-peer between GPUs, and from GPUs to NICs, etc. :)  It feels like the linux-rdma and linux-pci 
discussions in the past sort of stalled, due to not being certain of the long-term direction of the 
design. So it's worth coming up with that.



>>>   - CDM coherent device memory
>>>   - PMEM
>>>   - overall mm discussions
>> I would like to attend this discussion. I can talk about coherent device
>> memory and how having HMM handle that will make it easy to have one
>> interface for device driver. For Coherent device case we definitely need
>> page cache migration support.
>
> I have been in the discussion on the mailing list about HMM since V13 which
> got posted back in October. Touched upon many points including how it changes
> ZONE_DEVICE to accommodate un-addressable device memory, migration capability
> of currently supported ZONE_DEVICE based persistent memory etc. Looked at the
> HMM more closely from the perspective whether it can also accommodate coherent
> device memory which has been already discussed by others on this thread. I too
> would like to attend to discuss more on this topic.

Also, on the huge page points (mentioned early in this short thread): some of our GPUs could, at 
times, match the CPU's large/huge page sizes. It is a delicate thing to achieve, but moving around, 
say, 2 MB pages between CPU and GPU would be, for some workloads, really fast.

I should be able to present performance numbers for HMM on Pascal GPUs, so if anyone would like 
that, let me know in advance of any particular workloads or configurations that seem most 
interesting, and I'll gather that.

Also would like to attend this one.

thanks
John Hubbard
NVIDIA

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
  2016-12-16  3:14   ` Aneesh Kumar K.V
@ 2017-01-18 11:00     ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-01-18 11:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote:
> Jerome Glisse <jglisse@redhat.com> writes:
> 
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> >
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this is
> > worth discussing during summit see how people feels about such plan and
> > if they have better ideas.
> >
> >
> > I also like to join discussions on:
> >   - Peer-to-Peer DMAs between PCIe devices
> >   - CDM coherent device memory
> >   - PMEM
> >   - overall mm discussions
> 
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

Aneesh, did you intend this as your request to attend? You posted it as a
reply to another email so it is not really clear. Note that each attend
request should be a separate email so that it does not get lost...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
@ 2017-01-18 11:00     ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-01-18 11:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel

On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote:
> Jerome Glisse <jglisse@redhat.com> writes:
> 
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> >
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this is
> > worth discussing during summit see how people feels about such plan and
> > if they have better ideas.
> >
> >
> > I also like to join discussions on:
> >   - Peer-to-Peer DMAs between PCIe devices
> >   - CDM coherent device memory
> >   - PMEM
> >   - overall mm discussions
> 
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

Aneesh, did you intend this as your request to attend? You posted it as a
reply to another email so it is not really clear. Note that each attend
request should be a separate email so that it does not get lost...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2017-01-18 11:00 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse
2016-12-13 18:15 ` Jerome Glisse
2016-12-13 18:15 ` Jerome Glisse
2016-12-13 18:20 ` James Bottomley
2016-12-13 18:20   ` James Bottomley
2016-12-13 18:20   ` James Bottomley
2016-12-13 18:55   ` Jerome Glisse
2016-12-13 18:55     ` Jerome Glisse
2016-12-13 18:55     ` Jerome Glisse
2016-12-13 20:01     ` James Bottomley
2016-12-13 20:01       ` James Bottomley
2016-12-13 20:22       ` Jerome Glisse
2016-12-13 20:22         ` Jerome Glisse
2016-12-13 20:22         ` Jerome Glisse
2016-12-13 20:27       ` Dave Hansen
2016-12-13 20:27         ` Dave Hansen
2016-12-13 20:15 ` Dave Chinner
2016-12-13 20:15   ` Dave Chinner
2016-12-13 20:31   ` Jerome Glisse
2016-12-13 20:31     ` Jerome Glisse
2016-12-13 20:31     ` Jerome Glisse
2016-12-13 21:10     ` Dave Chinner
2016-12-13 21:10       ` Dave Chinner
2016-12-13 21:24       ` Jerome Glisse
2016-12-13 21:24         ` Jerome Glisse
2016-12-13 21:24         ` Jerome Glisse
2016-12-13 22:08         ` Dave Hansen
2016-12-13 22:08           ` Dave Hansen
2016-12-13 23:02           ` Jerome Glisse
2016-12-13 23:02             ` Jerome Glisse
2016-12-13 23:02             ` Jerome Glisse
2016-12-13 22:13         ` Dave Chinner
2016-12-13 22:13           ` Dave Chinner
2016-12-13 22:55           ` Jerome Glisse
2016-12-13 22:55             ` Jerome Glisse
2016-12-13 22:55             ` Jerome Glisse
2016-12-14  0:14             ` Dave Chinner
2016-12-14  0:14               ` Dave Chinner
2016-12-14  1:07               ` Jerome Glisse
2016-12-14  1:07                 ` Jerome Glisse
2016-12-14  1:07                 ` Jerome Glisse
2016-12-14  4:23                 ` Dave Chinner
2016-12-14  4:23                   ` Dave Chinner
2016-12-14 16:35                   ` Jerome Glisse
2016-12-14 16:35                     ` Jerome Glisse
2016-12-14 16:35                     ` Jerome Glisse
2016-12-14 11:13         ` [Lsf-pc] " Jan Kara
2016-12-14 11:13           ` Jan Kara
2016-12-14 17:15           ` Jerome Glisse
2016-12-14 17:15             ` Jerome Glisse
2016-12-14 17:15             ` Jerome Glisse
2016-12-15 16:19             ` Jan Kara
2016-12-15 16:19               ` Jan Kara
2016-12-15 19:14               ` Jerome Glisse
2016-12-15 19:14                 ` Jerome Glisse
2016-12-15 19:14                 ` Jerome Glisse
2016-12-16  8:14                 ` Jan Kara
2016-12-16  8:14                   ` Jan Kara
2016-12-16  3:10               ` Aneesh Kumar K.V
2016-12-16  3:10                 ` Aneesh Kumar K.V
2016-12-16  3:10                 ` Aneesh Kumar K.V
2016-12-19  8:46                 ` Jan Kara
2016-12-19  8:46                   ` Jan Kara
2016-12-19 17:00           ` Aneesh Kumar K.V
2016-12-19 17:00             ` Aneesh Kumar K.V
2016-12-14  3:55 ` Balbir Singh
2016-12-14  3:55   ` Balbir Singh
2016-12-16  3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V
2016-12-16  3:14   ` Aneesh Kumar K.V
2017-01-16 12:04   ` Anshuman Khandual
2017-01-16 12:04     ` Anshuman Khandual
2017-01-16 23:15     ` John Hubbard
2017-01-16 23:15       ` John Hubbard
2017-01-18 11:00   ` [Lsf-pc] " Jan Kara
2017-01-18 11:00     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.