All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about memory mapping mechanism
@ 2007-03-08 13:16 Martin Drab
  2007-03-08 14:20 ` Carsten Otte
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Martin Drab @ 2007-03-08 13:16 UTC (permalink / raw)
  To: hugh; +Cc: Linux Kernel Mailing List

Hi,

I'm writing a driver for a sampling device that is constantly delivering a 
relatively high amount of data (about 16 MB/s) and I need to deliver the 
data to the user-space ASAP. To prevent data loss I create a queue of 
buffers (consisting of few pages each) which are more or less directly 
filled by the device and then mapped to the user-space via mmap().

The thing is that I'd like to prevent kernel to swap these pages out, 
because then I may loose some data when they are not available in time 
for the next round.

My original idea (that used to work in the past) was to allocate the 
buffers using __get_free_pages(), then pin the pages down by setting their 
PG_reserved bit in the page flags before using them. And then set the 
VM_RESERVED flag of the appropriate VMA when mmap() is called for these 
pages that are then mapped using nopage() mechanism.

But this way no longer seems to work correctly, it kind of works, but I'm 
getting following messages for each mmapped page upon munmap() call:

--------------------------------------
[19172.939248] Bad page state in process 'dtrtest'
[19172.939249] page:ffff81000160a978 flags:0x001a000000000404 mapping:0000000000000000 mapcount:0 count:0
[19172.939251] Trying to fix it up, but a reboot is needed
[19172.939253] Backtrace:
[19172.939256]
[19172.939257] Call Trace:
[19172.939273]  [<ffffffff802adc37>] bad_page+0x57/0x90
[19172.939280]  [<ffffffff8020b92f>] free_hot_cold_page+0x7f/0x180
[19172.939287]  [<ffffffff80207a90>] unmap_vmas+0x450/0x750
[19172.939308]  [<ffffffff80212867>] unmap_region+0xb7/0x160
[19172.939318]  [<ffffffff80211918>] do_munmap+0x238/0x2f0
[19172.939325]  [<ffffffff802656c5>] __down_write_nested+0x35/0xf0
[19172.939334]  [<ffffffff80215ffd>] sys_munmap+0x4d/0x80
[19172.939341]  [<ffffffff8025f11e>] system_call+0x7e/0x83
-------------------------------

Aparently due to the PG_reserved bit set.

So my question is: What is currently a proper way to do all this cleanly?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-08 13:16 Question about memory mapping mechanism Martin Drab
@ 2007-03-08 14:20 ` Carsten Otte
       [not found] ` <5c77e7070703080617r63cbccfevd2c6d678f16c2b03@mail.gmail.com>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Carsten Otte @ 2007-03-08 14:20 UTC (permalink / raw)
  To: Martin Drab; +Cc: hugh, Linux Kernel Mailing List

On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> The thing is that I'd like to prevent kernel to swap these pages out,
> because then I may loose some data when they are not available in time
> for the next round.

One think you could do is grab a reference to the pages upfront. When
you stop pushing data out to the userspace, or at least when the file
is released, you need to drop that reference again. You could even do
a kmap_atomic(), which would give you a kernel space mapping. That
way, you avoid copy_to_user for that data.
I am not sure if that's the "proper way", just my $0.02 how I would
try to solve it.

Carsten

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
       [not found] ` <5c77e7070703080617r63cbccfevd2c6d678f16c2b03@mail.gmail.com>
@ 2007-03-08 21:36   ` Martin Drab
  2007-03-09  0:07     ` Martin Drab
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Drab @ 2007-03-08 21:36 UTC (permalink / raw)
  To: Carsten Otte; +Cc: hugh, Linux Kernel Mailing List

On Thu, 8 Mar 2007, Carsten Otte wrote:

> On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> > 
> > The thing is that I'd like to prevent kernel to swap these pages out,
> > because then I may loose some data when they are not available in time
> > for the next round.
> 
> One think you could do is grab a reference to the pages upfront.

I'm not really sure what exactly do you mean by "grab a reference 
upfront"?

> When you stop pushing data out to the userspace, or at least when the 
> file is released, you need to drop that reference again.

Or do you mean reference like with the get_page()? Sure, I do a get_page() 
in the nopage() handler for each page before it is passed to the 
user-space. That's OK, there is no problem. Problem seems to be in the 
PG_reserved bit set when the pages are unmapped from the userspace, i.e. 
when the application calls munmap(2).

> You could even do a kmap_atomic(), which would give you a kernel space 
> mapping. That way, you avoid copy_to_user for that data.

If I understand kmap_atomic() right, then it is not really what I need in 
this case. The kmap() just returns a virtual address (logical address in 
this case, since the pages are not in high memory) for a page. The 
kmap_atomic() does the same but disables preemption first, so all 
processing with the page needs to be atomic, which in this case can not 
be guaranteed.

Or do I get it wrong? I'm not really a kernel's memory management guru, so 
maybe I just don't get it. ;-)

But thanks anyway.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-08 13:16 Question about memory mapping mechanism Martin Drab
  2007-03-08 14:20 ` Carsten Otte
       [not found] ` <5c77e7070703080617r63cbccfevd2c6d678f16c2b03@mail.gmail.com>
@ 2007-03-08 23:21 ` Martin Drab
  2007-03-09  0:27 ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 13+ messages in thread
From: Martin Drab @ 2007-03-08 23:21 UTC (permalink / raw)
  To: hugh; +Cc: Linux Kernel Mailing List

On Thu, 8 Mar 2007, Martin Drab wrote:

> Hi,
> 
> I'm writing a driver for a sampling device that is constantly delivering a 
> relatively high amount of data (about 16 MB/s) and I need to deliver the 
> data to the user-space ASAP. To prevent data loss I create a queue of 
> buffers (consisting of few pages each) which are more or less directly 
> filled by the device and then mapped to the user-space via mmap().
> 
> The thing is that I'd like to prevent kernel to swap these pages out, 
> because then I may loose some data when they are not available in time 
> for the next round.
> 
> My original idea (that used to work in the past) was to allocate the 
> buffers using __get_free_pages(), then pin the pages down by setting their 
> PG_reserved bit in the page flags before using them. And then set the 
> VM_RESERVED flag of the appropriate VMA when mmap() is called for these 
> pages that are then mapped using nopage() mechanism.
> 
> But this way no longer seems to work correctly, it kind of works, but I'm 
> getting following messages for each mmapped page upon munmap() call:
> 
> --------------------------------------
> [19172.939248] Bad page state in process 'dtrtest'
> [19172.939249] page:ffff81000160a978 flags:0x001a000000000404 mapping:0000000000000000 mapcount:0 count:0
> [19172.939251] Trying to fix it up, but a reboot is needed
> [19172.939253] Backtrace:
> [19172.939256]
> [19172.939257] Call Trace:
> [19172.939273]  [<ffffffff802adc37>] bad_page+0x57/0x90
> [19172.939280]  [<ffffffff8020b92f>] free_hot_cold_page+0x7f/0x180
> [19172.939287]  [<ffffffff80207a90>] unmap_vmas+0x450/0x750
> [19172.939308]  [<ffffffff80212867>] unmap_region+0xb7/0x160
> [19172.939318]  [<ffffffff80211918>] do_munmap+0x238/0x2f0
> [19172.939325]  [<ffffffff802656c5>] __down_write_nested+0x35/0xf0
> [19172.939334]  [<ffffffff80215ffd>] sys_munmap+0x4d/0x80
> [19172.939341]  [<ffffffff8025f11e>] system_call+0x7e/0x83
> -------------------------------
> 
> Aparently due to the PG_reserved bit set.
> 
> So my question is: What is currently a proper way to do all this cleanly?

Ah, OK, so as I found here:

http://www-gatago.com/linux/kernel/14660780.html

I do not need to worry about the just-in-kernel pages being swapped out 
and thus need not set the PG_reserved. Good.

On the other hand, if I understand it correctly, then the buffer may 
potentionally become the lru if it is not used for a long time (which may 
happen, since there may be buffers that are used just from time to time to 
compensate a temporal disability of the application to process the data 
for some time. Or is it not the case?

How does a page get onto an lru list? Is there a way to prevent that at 
all costs? Or do I just need to hope that all the buffers will be used 
every now and then to prevent getting there.

I thought I would store the free waiting-to-be-used buffers in a LIFO 
stack rather then a classical FIFO ring queue to restrict the accesses to 
as little amount of memory as possible (for better caching perhaps?) and 
also perhaps to be able to keep some statistics about the usage of the 
buffers on the buffer stack, so that if some buffers are not going to 
be used for a long time (meaning that the application can process the data 
quickly without big problems), some buffers may be freed to free the 
memory (there may potentially be a huge amount of buffers consuming a 
lot of memory to prevent loosing any of the constantly incomming data if 
the application is temporally unable to process them for some reason).

That means, that the buffers at the bottom of the stack can be there 
unused for quite some while. Does it mean that they can potentially be 
automatically placed on the lru list? Because then when they would really 
be needed and they were swapped out, that would be a problem.

And about the user-space VMA mapping: Do I need to set the VM_RESERVED on 
the VMA when mmapping? I suppose I do. Or do I?

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-08 21:36   ` Martin Drab
@ 2007-03-09  0:07     ` Martin Drab
  2007-03-09  1:31       ` Martin Drab
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Drab @ 2007-03-09  0:07 UTC (permalink / raw)
  To: Carsten Otte; +Cc: hugh, Linux Kernel Mailing List

On Thu, 8 Mar 2007, Martin Drab wrote:

> On Thu, 8 Mar 2007, Carsten Otte wrote:
> 
> > On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> > > 
> > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > because then I may loose some data when they are not available in time
> > > for the next round.
> > 
> > One think you could do is grab a reference to the pages upfront.
> 
> I'm not really sure what exactly do you mean by "grab a reference 
> upfront"?

I seem to get it now. So instead of setting PG_reserved upon allocation of 
the buffer pages, I should increase the referrence of the pages by calling 
get_page() on them and that would prevent the pages to get on the lru list 
and thus preventing them to be swapped out. Is that it?

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-08 13:16 Question about memory mapping mechanism Martin Drab
                   ` (2 preceding siblings ...)
  2007-03-08 23:21 ` Martin Drab
@ 2007-03-09  0:27 ` Jeremy Fitzhardinge
  2007-03-09  0:38   ` Martin Drab
  3 siblings, 1 reply; 13+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09  0:27 UTC (permalink / raw)
  To: Martin Drab; +Cc: hugh, Linux Kernel Mailing List, Bryan O'Sullivan

Martin Drab wrote:
> Hi,
>
> I'm writing a driver for a sampling device that is constantly delivering a 
> relatively high amount of data (about 16 MB/s) and I need to deliver the 
> data to the user-space ASAP. To prevent data loss I create a queue of 
> buffers (consisting of few pages each) which are more or less directly 
> filled by the device and then mapped to the user-space via mmap().
>
> The thing is that I'd like to prevent kernel to swap these pages out, 
> because then I may loose some data when they are not available in time 
> for the next round.
>
> My original idea (that used to work in the past) was to allocate the 
> buffers using __get_free_pages(), then pin the pages down by setting their 
> PG_reserved bit in the page flags before using them. And then set the 
> VM_RESERVED flag of the appropriate VMA when mmap() is called for these 
> pages that are then mapped using nopage() mechanism.
>
> But this way no longer seems to work correctly, it kind of works, but I'm 
> getting following messages for each mmapped page upon munmap() call:
>
> --------------------------------------
> [19172.939248] Bad page state in process 'dtrtest'
> [19172.939249] page:ffff81000160a978 flags:0x001a000000000404 mapping:0000000000000000 mapcount:0 count:0
> [19172.939251] Trying to fix it up, but a reboot is needed
> [19172.939253] Backtrace:
> [19172.939256]
> [19172.939257] Call Trace:
> [19172.939273]  [<ffffffff802adc37>] bad_page+0x57/0x90
> [19172.939280]  [<ffffffff8020b92f>] free_hot_cold_page+0x7f/0x180
> [19172.939287]  [<ffffffff80207a90>] unmap_vmas+0x450/0x750
> [19172.939308]  [<ffffffff80212867>] unmap_region+0xb7/0x160
> [19172.939318]  [<ffffffff80211918>] do_munmap+0x238/0x2f0
> [19172.939325]  [<ffffffff802656c5>] __down_write_nested+0x35/0xf0
> [19172.939334]  [<ffffffff80215ffd>] sys_munmap+0x4d/0x80
> [19172.939341]  [<ffffffff8025f11e>] system_call+0x7e/0x83
> -------------------------------
>
> Aparently due to the PG_reserved bit set.
>
> So my question is: What is currently a proper way to do all this cleanly?
>   


Have you looked at the Infiniband stuff?  I know the folks working on
the ipath driver eventually got this kind of thing working in a sane way.

    J

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-09  0:27 ` Jeremy Fitzhardinge
@ 2007-03-09  0:38   ` Martin Drab
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Drab @ 2007-03-09  0:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: hugh, Linux Kernel Mailing List, Bryan O'Sullivan

On Thu, 8 Mar 2007, Jeremy Fitzhardinge wrote:

> Martin Drab wrote:
> > Hi,
> >
> > I'm writing a driver for a sampling device that is constantly delivering a 
> > relatively high amount of data (about 16 MB/s) and I need to deliver the 
> > data to the user-space ASAP. To prevent data loss I create a queue of 
> > buffers (consisting of few pages each) which are more or less directly 
> > filled by the device and then mapped to the user-space via mmap().
> >
> > The thing is that I'd like to prevent kernel to swap these pages out, 
> > because then I may loose some data when they are not available in time 
> > for the next round.
> >
> > My original idea (that used to work in the past) was to allocate the 
> > buffers using __get_free_pages(), then pin the pages down by setting their 
> > PG_reserved bit in the page flags before using them. And then set the 
> > VM_RESERVED flag of the appropriate VMA when mmap() is called for these 
> > pages that are then mapped using nopage() mechanism.
> >
> > But this way no longer seems to work correctly, it kind of works, but I'm 
> > getting following messages for each mmapped page upon munmap() call:
> >
> > --------------------------------------
> > [19172.939248] Bad page state in process 'dtrtest'
> > [19172.939249] page:ffff81000160a978 flags:0x001a000000000404 mapping:0000000000000000 mapcount:0 count:0
> > [19172.939251] Trying to fix it up, but a reboot is needed
> > [19172.939253] Backtrace:
> > [19172.939256]
> > [19172.939257] Call Trace:
> > [19172.939273]  [<ffffffff802adc37>] bad_page+0x57/0x90
> > [19172.939280]  [<ffffffff8020b92f>] free_hot_cold_page+0x7f/0x180
> > [19172.939287]  [<ffffffff80207a90>] unmap_vmas+0x450/0x750
> > [19172.939308]  [<ffffffff80212867>] unmap_region+0xb7/0x160
> > [19172.939318]  [<ffffffff80211918>] do_munmap+0x238/0x2f0
> > [19172.939325]  [<ffffffff802656c5>] __down_write_nested+0x35/0xf0
> > [19172.939334]  [<ffffffff80215ffd>] sys_munmap+0x4d/0x80
> > [19172.939341]  [<ffffffff8025f11e>] system_call+0x7e/0x83
> > -------------------------------
> >
> > Aparently due to the PG_reserved bit set.
> >
> > So my question is: What is currently a proper way to do all this cleanly?
> 
> Have you looked at the Infiniband stuff?  I know the folks working on
> the ipath driver eventually got this kind of thing working in a sane way.

I didn't. But I will, thanks a lot for the tip.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-09  0:07     ` Martin Drab
@ 2007-03-09  1:31       ` Martin Drab
  2007-03-13 20:49         ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Drab @ 2007-03-09  1:31 UTC (permalink / raw)
  To: Carsten Otte; +Cc: hugh, Linux Kernel Mailing List

On Fri, 9 Mar 2007, Martin Drab wrote:

> On Thu, 8 Mar 2007, Martin Drab wrote:
> 
> > On Thu, 8 Mar 2007, Carsten Otte wrote:
> > 
> > > On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> > > > 
> > > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > > because then I may loose some data when they are not available in time
> > > > for the next round.
> > > 
> > > One think you could do is grab a reference to the pages upfront.
> > 
> > I'm not really sure what exactly do you mean by "grab a reference 
> > upfront"?
> 
> I seem to get it now. So instead of setting PG_reserved upon allocation of 
> the buffer pages, I should increase the referrence of the pages by calling 
> get_page() on them and that would prevent the pages to get on the lru list 
> and thus preventing them to be swapped out. Is that it?

Well, so I tried it. Truth is that the "Bad page" messages upon munmap(2) 
are gone. But whether the pages are really prevented from being swapped 
out? I don't know. I don't know how to find out.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-09  1:31       ` Martin Drab
@ 2007-03-13 20:49         ` Hugh Dickins
  2007-05-21 18:39           ` Martin Drab
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2007-03-13 20:49 UTC (permalink / raw)
  To: Martin Drab; +Cc: Carsten Otte, Linux Kernel Mailing List

On Fri, 9 Mar 2007, Martin Drab wrote:
> On Fri, 9 Mar 2007, Martin Drab wrote:
> > On Thu, 8 Mar 2007, Martin Drab wrote:
> > > On Thu, 8 Mar 2007, Carsten Otte wrote:
> > > > On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> > > > > 
> > > > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > > > because then I may loose some data when they are not available in time
> > > > > for the next round.
> > > > 
> > > > One think you could do is grab a reference to the pages upfront.
> > > 
> > > I'm not really sure what exactly do you mean by "grab a reference 
> > > upfront"?
> > 
> > I seem to get it now. So instead of setting PG_reserved upon allocation of 
> > the buffer pages, I should increase the referrence of the pages by calling 
> > get_page() on them and that would prevent the pages to get on the lru list 
> > and thus preventing them to be swapped out. Is that it?
> 
> Well, so I tried it. Truth is that the "Bad page" messages upon munmap(2) 
> are gone. But whether the pages are really prevented from being swapped 
> out? I don't know. I don't know how to find out.

Hi Martin, sorry for joining so late.

Most of your anxieties are unfounded: the pages your driver allocates
with __get_free_pages are not put on any LRU (until they're freed),
kernel pages are never swapped out, and making them visible to user-
space does not put them in any more danger of being swapped out;
but yes, if the refcounting goes wrong, that will cause premature
freeing, "Bad page" messages, trouble.

(Of course, filesystem pagecache pages are liable to be swapped
out, and anonymous userspace pages: but you'd have to take special
steps to go down those paths, which I'm confident you're not taking:
this code used to work, you say.)

Please disregard the suggestion to look at Infiniband, it will only
confuse you further: Infiniband core/uverbs_mem.c does provide a very
good example of how to handle the much more complex opposite of what
you're trying to do.  You have a driver making its pages visible to
userspace, Infiniband shows how a driver should deal with userspace
buffers made available to it (which does involve worrying about
those issues which were concerning you).

Your problem is that PageReserved used to work for you, and now
(post-2.6.14) it doesn't: we do now rely entirely on the refcount,
and will report "Bad page state" if a pagecount goes down to zero
while still marked as PageReserved, which is what you saw.

My guess is that you're using __get_free_pages with non-0 order?
But mapping individual PAGE_SIZE pages from that into userspace
by a nopage method?  That is liable to be a problem, yes, because
the refcount for the whole is kept in the first struct page, the
later struct pages showing refcount 0: the whole is supposed to
be dealt with all together, but userspace page accounting on exit
will treat each PAGE_SIZE separately.  PageReserved used to
override the refcount, but it no longer does so.

There's a number of different solutions to that, and fiddling with
the reference counts of the constituent pages is certainly one of
them; though better is to use split_page() (see mm/page_alloc.c),
then free the constituent pages separately at the end; (and better
is to avoid >0-order allocations since they're harder to guarantee,
but presumably you don't want to reorganize your driver right now;)
but the simplest change is to __get_free_pages with the __GFP_COMP
flag set, which marks all the constituent pages as constituents of
a compound page, and thereby keeps the refcounting right - that's
the solution we used in sound/core/memalloc.c when this problem
first came up (but it won't work pre-2.6.15).

Maybe the solution you've already adopted, with additional
get_page()s, is correct: but you need to be careful when your
module is unloaded, there's a danger of leaking the memory if
you don't put_page() the first, and there's a danger of... I
forget what exactly if you don't reset the counts properly on
the subsequent constituents.  __GFP_COMP to hold the compound
page properly together, or split_page() to split it properly
apart, are preferable.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-13 20:49         ` Hugh Dickins
@ 2007-05-21 18:39           ` Martin Drab
  2007-05-29 17:58             ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Drab @ 2007-05-21 18:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Carsten Otte, Linux Kernel Mailing List

On Tue, 13 Mar 2007, Hugh Dickins wrote:

> On Fri, 9 Mar 2007, Martin Drab wrote:
> > On Fri, 9 Mar 2007, Martin Drab wrote:
> > > On Thu, 8 Mar 2007, Martin Drab wrote:
> > > > On Thu, 8 Mar 2007, Carsten Otte wrote:
> > > > > On 3/8/07, Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> > > > > > 
> > > > > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > > > > because then I may loose some data when they are not available in time
> > > > > > for the next round.
> > > > > 
> > > > > One think you could do is grab a reference to the pages upfront.
> > > > 
> > > > I'm not really sure what exactly do you mean by "grab a reference 
> > > > upfront"?
> > > 
> > > I seem to get it now. So instead of setting PG_reserved upon allocation of 
> > > the buffer pages, I should increase the referrence of the pages by calling 
> > > get_page() on them and that would prevent the pages to get on the lru list 
> > > and thus preventing them to be swapped out. Is that it?
> > 
> > Well, so I tried it. Truth is that the "Bad page" messages upon munmap(2) 
> > are gone. But whether the pages are really prevented from being swapped 
> > out? I don't know. I don't know how to find out.
> 
> Hi Martin, sorry for joining so late.

Hi, Hugh, this time I'm sorry for responding sooo late.

> Most of your anxieties are unfounded: the pages your driver allocates
> with __get_free_pages are not put on any LRU (until they're freed),
> kernel pages are never swapped out, and making them visible to user-
> space does not put them in any more danger of being swapped out;
> but yes, if the refcounting goes wrong, that will cause premature
> freeing, "Bad page" messages, trouble.

Good, I needed to know that.

> (Of course, filesystem pagecache pages are liable to be swapped
> out, and anonymous userspace pages: but you'd have to take special
> steps to go down those paths, which I'm confident you're not taking:
> this code used to work, you say.)

Well it wasn't this code in particular, but another driver I was putting 
together some while ago and which used the same technique for mmap().
This what I'm putting together now is a new driver, but in this doing 
simillar thing.

> Please disregard the suggestion to look at Infiniband, it will only
> confuse you further: Infiniband core/uverbs_mem.c does provide a very
> good example of how to handle the much more complex opposite of what
> you're trying to do.  You have a driver making its pages visible to
> userspace, Infiniband shows how a driver should deal with userspace
> buffers made available to it (which does involve worrying about
> those issues which were concerning you).

Well, yes, in this case I'm solving the direction when kernel acquires 
data into kernel-allocated pages and provides them to user-space. However 
the second part of the driver (that I don't yet have finished, but will 
write in the future) will do the opposite. I.e. the application will 
generate data, that would have to be sent to the device. And of course I'd 
be glad if you help me choose the best way to do that.

Would in that case be better to do the same as I do for the first part, 
i.e. mmap() kernel pages to user-space, let the app fill them, and then 
send (queue for sending) them when munmap()ped? Or would it be better to 
use the infiniband approach to mmap() the user-space pages and send them 
directly? (The pages would have to be transferrable by the USB DMA, 
possibly without too much overhead.) Perhaps in this case the latter case 
would be better? Or maybe use some completely different approach?

...
> My guess is that you're using __get_free_pages with non-0 order?
> But mapping individual PAGE_SIZE pages from that into userspace
> by a nopage method?  That is liable to be a problem, yes, because
> the refcount for the whole is kept in the first struct page, the
> later struct pages showing refcount 0: the whole is supposed to
> be dealt with all together, but userspace page accounting on exit
> will treat each PAGE_SIZE separately.  PageReserved used to
> override the refcount, but it no longer does so.
> 
> There's a number of different solutions to that, and fiddling with
> the reference counts of the constituent pages is certainly one of
> them; though better is to use split_page() (see mm/page_alloc.c),
> then free the constituent pages separately at the end; (and better
> is to avoid >0-order allocations since they're harder to guarantee,
> but presumably you don't want to reorganize your driver right now;)
> but the simplest change is to __get_free_pages with the __GFP_COMP
> flag set, which marks all the constituent pages as constituents of
> a compound page, and thereby keeps the refcounting right - that's
> the solution we used in sound/core/memalloc.c when this problem
> first came up (but it won't work pre-2.6.15).

Yes I'm using non-0 order pages. The size of the transfer buffers is 
configurable depending on the preselected size of the transfer block. 
Currently default value is 32 KB (8 pages), but can be possibly even 
bigger (because the smaller the transfer buffers, the more overhead for 
the system is required).

However, meanwhile things have changed a little. I found out, that using 
the vma_nopage() for mapping the individual pages is highly inefficient, 
since it means a user-space to kernel-space switch and back for every page 
and that seems to slow the system down dramatically. Another severe 
slowdown was caused by processing (getting info about the buffer to be 
mmapped to the user-space, mmapping and processing the buffer data) the 
data of only one buffer at a time by the application.

So I had to rethink the strategy completely. The processing had to be done 
not one buffer at a time, but all (filled) buffers available at a time 
(with certain application-defined maximum, of course), so multiple buffers 
of page order 3 (=32KB, by default) have to be mmapped sequentially one 
after another in a predefined order to mmap() everything to the 
user-space as a continuous block of data. Another thing is that on each 
mmap() call I know exactly how much memory has to be mmapped, so I can 
allow to force-mmap all the memory in advance during the mmap() call.

And so I thought I'd allocate each individual buffer by the 
__get_free_pages() with the __GFP_COMP set, to make the entire buffer 
behave as one compound page and then during the mmap() call do a 
vm_insert_page() on each compount page representing each buffer I want to 
mmap in that call.

The comment in mm/memory.c by the vm_insert_page() says:

 * If you allocate a compound page, you need to have marked it as
 * such (__GFP_COMP), or manually just split the page up yourself
 * (see split_page()).

So I did and thought it would work. But it doesn't seem to. The 
vm_insert_page() call passes allright, but it still seems to mmap only the 
first page of each of the compount page, since when the application is 
accessing the data the nopage() method is still called for all other pages 
but the first one of each of the compound pages. Does that mean, that I 
still need to insert each individual page from within the compound page by 
the vm_insert_page() or what? (If that's so, perhaps it would be nice to 
comment it in the vm_insert_page() function.)

And about the reorganization of the driver to not make the >0 order page 
allocations. Well, I would do it, if you know about any other solution 
that would allow to do a USB URB transfer into a buffer (perhaps 
consisting of individually allocated 0-order pages as you suggest) that 
however definitelly needs to be bigger than one page. Perhaps something 
like a scatter-gather would do, but I don't know about any such mechanism 
for USB URBs.

The transfers need to be bigger than a page because the overhead 
of doing the transfer only by one page at a time would be unbearable.
It's an isochronous transfer doing exactly 3072 B each 125 microseconds.
Anything less than doing 8 of such transfers at one usb_submit_urb() call 
seems to posses too much overhead. If there is any other solution, I'd 
gladly change the driver, but unfortunatelly I don't know of any.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-05-21 18:39           ` Martin Drab
@ 2007-05-29 17:58             ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2007-05-29 17:58 UTC (permalink / raw)
  To: Martin Drab; +Cc: Carsten Otte, Linux Kernel Mailing List

On Mon, 21 May 2007, Martin Drab wrote:
> On Tue, 13 Mar 2007, Hugh Dickins wrote:
> > On Fri, 9 Mar 2007, Martin Drab wrote:
> > 
> > Hi Martin, sorry for joining so late.
> 
> Hi, Hugh, this time I'm sorry for responding sooo late.

Ditto.

And preface: I'm a lousy person to be advising on driver writing, it
often turns out that I'm making a fool of myself one way or another.
If you're very lucky, not this time!

> Well, yes, in this case I'm solving the direction when kernel acquires 
> data into kernel-allocated pages and provides them to user-space. However 
> the second part of the driver (that I don't yet have finished, but will 
> write in the future) will do the opposite. I.e. the application will 
> generate data, that would have to be sent to the device. And of course I'd 
> be glad if you help me choose the best way to do that.
> 
> Would in that case be better to do the same as I do for the first part, 
> i.e. mmap() kernel pages to user-space, let the app fill them, and then 
> send (queue for sending) them when munmap()ped? Or would it be better to 
> use the infiniband approach to mmap() the user-space pages and send them 
> directly? (The pages would have to be transferrable by the USB DMA, 
> possibly without too much overhead.) Perhaps in this case the latter case 
> would be better? Or maybe use some completely different approach?

I'd strongly recommend that do as you are already doing, not switch
over to the Infiniband method.  The Infiniband drivers have to satisfy
standards that demand that userspace define the buffers, and that adds
a great deal of complexity and scope for error.  Unless you have to
satisfy an external standard of that kind, let your driver define
the buffers as it already does.  (From what you say later about 32kB
extents, I can't see how you could let userspace define them anyway.)

> Yes I'm using non-0 order pages. The size of the transfer buffers is 
> configurable depending on the preselected size of the transfer block. 
> Currently default value is 32 KB (8 pages), but can be possibly even 
> bigger (because the smaller the transfer buffers, the more overhead for 
> the system is required).

32kB, order 3.  Okay, __alloc_pages will retry for those, so there's
a reasonable chance you'll be able to get them.  But be aware that
asking for non-0 order pages will always place some strain on the
memory management system, no matter what anti-fragmentation measures
are taken now or in future.  I take it you have a hardware constraint
that makes the smaller buffers significantly less efficient; if it's
just a software issue, then better to redesign to use 0-order pages.

> However, meanwhile things have changed a little. I found out, that using 
> the vma_nopage() for mapping the individual pages is highly inefficient, 
> since it means a user-space to kernel-space switch and back for every page 
> and that seems to slow the system down dramatically. Another severe 
> slowdown was caused by processing (getting info about the buffer to be 
> mmapped to the user-space, mmapping and processing the buffer data) the 
> data of only one buffer at a time by the application.

Most drivers would use the same set of pages over and over again,
not have to fault each page each time it's used.

> So I had to rethink the strategy completely. The processing had to be done 
> not one buffer at a time, but all (filled) buffers available at a time 
> (with certain application-defined maximum, of course), so multiple buffers 
> of page order 3 (=32KB, by default) have to be mmapped sequentially one 
> after another in a predefined order to mmap() everything to the 
> user-space as a continuous block of data. Another thing is that on each 
> mmap() call I know exactly how much memory has to be mmapped, so I can 
> allow to force-mmap all the memory in advance during the mmap() call.

But yes, that's fine, many drivers do it like that.

> And so I thought I'd allocate each individual buffer by the 
> __get_free_pages() with the __GFP_COMP set, to make the entire buffer 
> behave as one compound page and then during the mmap() call do a 
> vm_insert_page() on each compount page representing each buffer I want to 
> mmap in that call.
> 
> The comment in mm/memory.c by the vm_insert_page() says:
> 
>  * If you allocate a compound page, you need to have marked it as
>  * such (__GFP_COMP), or manually just split the page up yourself
>  * (see split_page()).
> 
> So I did and thought it would work. But it doesn't seem to. The 
> vm_insert_page() call passes allright, but it still seems to mmap only the 
> first page of each of the compount page, since when the application is 
> accessing the data the nopage() method is still called for all other pages 
> but the first one of each of the compound pages. Does that mean, that I 
> still need to insert each individual page from within the compound page by 
> the vm_insert_page() or what? (If that's so, perhaps it would be nice to 
> comment it in the vm_insert_page() function.)

That's right, vm_insert_page inserts just a single pte.  We're rather
schizophrenic when we say "page", sometimes we're thinking about "a
compound page" or a "high-order page", and sometimes we're thinking
about all the component "struct page"s that make that up: confusing.
I've made a note to add such a comment there.

> And about the reorganization of the driver to not make the >0 order page 
> allocations. Well, I would do it, if you know about any other solution 
> that would allow to do a USB URB transfer into a buffer (perhaps 
> consisting of individually allocated 0-order pages as you suggest) that 
> however definitelly needs to be bigger than one page. Perhaps something 
> like a scatter-gather would do, but I don't know about any such mechanism 
> for USB URBs.

Oh, I've been repeating myself, have I?  Scatter-gather is indeed
what we prefer to use, instead of demanding >0-order pages.   But
no way can I advise you on USB URBs.

Hugh

> The transfers need to be bigger than a page because the overhead 
> of doing the transfer only by one page at a time would be unbearable.
> It's an isochronous transfer doing exactly 3072 B each 125 microseconds.
> Anything less than doing 8 of such transfers at one usb_submit_urb() call 
> seems to posses too much overhead. If there is any other solution, I'd 
> gladly change the driver, but unfortunatelly I don't know of any.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
  2007-03-09  0:35 ` Robert Hancock
@ 2007-03-09  0:52   ` Martin Drab
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Drab @ 2007-03-09  0:52 UTC (permalink / raw)
  To: Robert Hancock; +Cc: hugh, linux-kernel

On Thu, 8 Mar 2007, Robert Hancock wrote:

> Martin Drab wrote:
> > Hi,
> > 
> > I'm writing a driver for a sampling device that is constantly delivering a
> > relatively high amount of data (about 16 MB/s) and I need to deliver the
> > data to the user-space ASAP. To prevent data loss I create a queue of
> > buffers (consisting of few pages each) which are more or less directly
> > filled by the device and then mapped to the user-space via mmap().
> > 
> > The thing is that I'd like to prevent kernel to swap these pages out,
> > because then I may loose some data when they are not available in time for
> > the next round.
> 
> It would likely be easier to just mlock this buffer from the userspace
> application, rather than trying to achieve this in the driver..

Well yes, but it is the kernel that needs to have it guaranteed and 
especially on the buffers that aren't currently mmapped to user-space, but 
are free to be filled with data from the device. And generally we can not 
rely on an application to do the mlock(), I guess. If the application 
wouldn't do it and if it would happen that some of the buffers would be 
swapped out, we may end up with a page fault within an interrupt.

And besides, I guess the mlock() is effective only as long as the pages 
are really mapped into the user-space. Since when they are then unmapped 
by munmap(2) when the application have read the data, their VMA eventually 
seizes to exist and the mlock() protection would be gone anyway. Right?

We need a permanent protection. But perhaps artificially increasing the 
page's reference count via get_page() upon allocation, instead of setting 
the PG_reserved bit, would do. (If I got it right.)

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about memory mapping mechanism
       [not found] <fa.8Fy6n0Le4Y2mDfeEy936/5bbJHA@ifi.uio.no>
@ 2007-03-09  0:35 ` Robert Hancock
  2007-03-09  0:52   ` Martin Drab
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Hancock @ 2007-03-09  0:35 UTC (permalink / raw)
  To: Martin Drab, hugh, linux-kernel

Martin Drab wrote:
> Hi,
> 
> I'm writing a driver for a sampling device that is constantly delivering a 
> relatively high amount of data (about 16 MB/s) and I need to deliver the 
> data to the user-space ASAP. To prevent data loss I create a queue of 
> buffers (consisting of few pages each) which are more or less directly 
> filled by the device and then mapped to the user-space via mmap().
> 
> The thing is that I'd like to prevent kernel to swap these pages out, 
> because then I may loose some data when they are not available in time 
> for the next round.

It would likely be easier to just mlock this buffer from the userspace 
application, rather than trying to achieve this in the driver..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-05-29 17:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-08 13:16 Question about memory mapping mechanism Martin Drab
2007-03-08 14:20 ` Carsten Otte
     [not found] ` <5c77e7070703080617r63cbccfevd2c6d678f16c2b03@mail.gmail.com>
2007-03-08 21:36   ` Martin Drab
2007-03-09  0:07     ` Martin Drab
2007-03-09  1:31       ` Martin Drab
2007-03-13 20:49         ` Hugh Dickins
2007-05-21 18:39           ` Martin Drab
2007-05-29 17:58             ` Hugh Dickins
2007-03-08 23:21 ` Martin Drab
2007-03-09  0:27 ` Jeremy Fitzhardinge
2007-03-09  0:38   ` Martin Drab
     [not found] <fa.8Fy6n0Le4Y2mDfeEy936/5bbJHA@ifi.uio.no>
2007-03-09  0:35 ` Robert Hancock
2007-03-09  0:52   ` Martin Drab

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.