linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/15] epoll: support pollable epoll from userspace
@ 2019-01-09 16:40 Roman Penyaev
  2019-01-09 16:40 ` Roman Penyaev
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Luis R. Rodriguez, Paul E. McKenney, Al Viro,
	Andrea Parri, Andrew Morton, Andrey Ryabinin, Davidlohr Bueso,
	Jason Baron, Joe Perches, Linus Torvalds, Michal Hocko,
	linux-fsdevel, linux-mm, linux-kernel

Hi all,

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call.  When ring is empty, user has to call epoll_wait()
in order to wait for new events.  epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user().  I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does).  What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

Also I required vrealloc(), which can hide all this "alloc new area - get
pages - map pages" stuff.  So vrealloc() is introduced in first 3 patches.

** Limitations
    
1. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
     obviously we can't call vfs_epoll() from userpace to have level
     triggered behaviour.
    
2. No support for EPOLLWAKEUP
     events are consumed from userspace, thus no way to call __pm_relax()
    
3. No support for EPOLLEXCLUSIVE
     If device does not pass pollflags to wake_up() there is no way to
     call poll() from the context under spinlock, thus special work is
     scheduled to offload polling.  In this specific case we can't
     support exclusive wakeups, because we do not know actual result
     of scheduled work and have to wake up every waiter.
    
4. No support for nesting of epoll descriptors polled from userspace
     no real good reason to scan ready events of user ring from the
     kernel, so just do not do that.


** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

 struct user_epitem {
    unsigned int ready_events;
    struct epoll_event event;
 };
 BUILD_BUG_ON(sizeof(struct user_epitem) != 16);

And the following is a header, which is seen by userspace:

 struct user_header {
    unsigned int magic;          /* epoll user header magic */
    unsigned int state;          /* epoll ring state */
    unsigned int header_length;  /* length of the header + items */
    unsigned int index_length;   /* length of the index ring */
    unsigned int max_items_nr;   /* max num of items slots */
    unsigned int max_index_nr;   /* max num of items indeces, always pow2 */
    unsigned int head;           /* updated by userland */
    unsigned int tail;           /* updated by kernel */
    unsigned int padding[24];    /* Header size is 128 bytes */

    struct user_epitem items[];
 };

 /* Header is 128 bytes, thus items are aligned on CPU cache */
 BUILD_BUG_ON(sizeof(struct user_header) != 128);

>From the very beginning kernel allocates 1 page for user header, i.e. by
default we have 248 items for 4096 size page.

When 249'th item is inserted special expanding should be done, which will
be discussed later.

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

 unsinged int *user_index;

Kernel allocates also 1 page for user index, i.e. for 4096 page we have
1024 ring elements capacity.


* How is new event accounted on kernel side?  Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

 struct user_epitem *uitem;

 /* Each item has a bit (index in user items array), discussed later */
 uitem = user_header->items[epi->bit];

 if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
     i = atomic_add(&ep->user_header->tail, 1);

     item_idx = &user_index[i & index_mask];

     /* Signal with a bit, user spins on index expecting value > 0 */
     *item_idx = idx + 1;

    /*
     * Want index update be flushed from CPU write buffer and
     * immediately visible on userspace side to avoid long busy
     * loops.
     */
     smp_wmb();
 }

Important thing here is that ring can't infinitely grow and corrupt other
elements, because kernel always checks that item was marked as ready, so
userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

 tail = READ_ONCE(header->tail);
 for (i = 0; header->head != tail; header->head++) {
     item_idx_ptr = &index[idx & indeces_mask];

     /*
      * Spin here till we see valid index
      */
     while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
         ;

     item = &header->items[idx - 1];

     /*
      * Mark index as invalid, that is for userspace only, kernel does not care
      * and will refill this pointer only when observes that event is cleared,
      * which happens below.
      */
     *item_idx_ptr = 0;

     /*
      * Fetch data first, if event is cleared by the kernel we drop the data
      * returning false.
      */
     event->data = item->event.data;
     event->events = __atomic_exchange_n(&item->ready_events, 0,
                         __ATOMIC_RELEASE);

 }


* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item.  When bitmap is full - it has been expanded.

* What happens when user items or user index has to be expanded or shrunk?

For that quite rare cases kernel has to ask userspace to invoke epoll_wait()
in order to reallocate all user pointers under locks, i.e. for that
particular period all events are routed to kernel lists instead of user
ring and kernel sets special INACTIVE state in user header in order to
notify user that new event's won't appear in the ring until the user
calls epoll_wait().  Worth to mention, that expand is done directly inside
ep_insert(), because expand is an allocation of a new page and recreation
of virtual area on kernel side, which does not affect mappings on userside.

* How userspace detects that kernel has expanded or shrunk the memory?

Any of the item ctl operations (add, mod, del) can be executed in parallel
with events consumption from user ring.

Expand is safe from user perspective (new pages is mapped to kernel side,
but user does not know and care about that), so expand happens directly
in epoll_ctl(EPOLL_CTL_ADD), but kernel routes all new events to kernel
lists and asks user to call epoll_wait() with special INACTIVE state.

Shrink is a bit different.  When epoll_ctl(EPOLL_CTL_DEL) is called and
kernel decides to shrink the memory, it routes new events to kernel lists,
marks user header state as INACTIVE and does not put item bit immediately,
but postpones it until user calls epoll_wait() (which should happen soon,
because user_header->state is INACTIVE and user should come to sleep to
kernel).  So shrink happens only on epoll_wait() call with all necessary
locks taken.

Bit put should be postponed because user can observe corrupted event item
if events are not yet consumed from the ring, bit is put and then
immediately reused by concurrent item insert.  To avoid this possible
race bit put is postponed when header state is INACTIVE and all events
are routed to kernel lists.

So returning to the quesion: how userspace detects that kernel has changed
the memory?  User has to cache lengths before epoll_wait(), compare old
cached values with new from header and call mremap() if values differ:

 header_length = header->header_length;
 index_length = header->index_length;

 rc = epoll_wait(epfd, NULL, 0, -1);
 assert(rc < 0);
 if (errno != -ESTALE)
     return -errno;

 if (header_length != header->header_length) {
    header = mremap(header, header_length, header->header_length, MREMAP_MAYMOVE);
    assert(header != MAP_FAILED);
 }
 if (index_length != header->index_length) {
    index = mremap(index, index_length, header->index_length, MREMAP_MAYMOVE);
    assert(index != MAP_FAILED);
 }

* Is it possible to consume events from many threads on userspace side?

That should be possible in a lockless manner, and kernel keeps extra number
of free slots in a ring (EPOLL_USER_EXTRA_INDEX_NR = 16) in order to let
user consume events from up to 16 threads in parallel.

It seems that this can be a good feature thinking about performance, but I
could not decide is it enough to report this value in a user header or let
user change that somehow on epoll_create1() call (or a new one?).

* Is there any testing app available?

There is a small app [1] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.


This is RFC because for memory allocation I used vmalloc(), which virtual
space for kernel seems limited for some archs.  So for example for 1 mln
of items kernel has to allocate 10^6 x 16 [items] + 10^6 x 4 [index],
that is around ~20mb, seems very small, but not sure is it ok or not.

I temporarily used gcc atomic builtins on kernel side, because I did find
any good way to atomically update plain unsigned int of user_header
structure without casting it to atomic_t.  Or casting is fine in that case?

There are not enough good, informative and shiny comments in the code,
explaining all the machinery.  The most hard part is left, I would say.

Only very basic scenarios are tested, all these things with user
reallocations (expand, shrink) are not tested at all.

[1] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c

Roman Penyaev (15):
  mm/vmalloc: add new 'alignment' field for vm_struct structure
  mm/vmalloc: move common logic from  __vmalloc_area_node to a separate
    func
  mm/vmalloc: introduce new vrealloc() call and its subsidiary reach
    analog
  epoll: move private helpers from a header to the source
  epoll: introduce user header structure and user index for polling from
    userspace
  epoll: introduce various of helpers for user structure lengths
    calculations
  epoll: extend epitem struct with new members for polling from
    userspace
  epoll: some sanity flags checks for epoll syscalls for polled epfd
    from userspace
  epoll: introduce stand-alone helpers for polling from userspace
  epoll: support polling from userspace for ep_insert()
  epoll: offload polling to a work in case of epfd polled from userspace
  epoll: support polling from userspace for ep_remove()
  epoll: support polling from userspace for ep_modify()
  epoll: support polling from userspace for ep_poll()
  epoll: support mapping for epfd when polled from userspace

 fs/eventpoll.c                 | 1042 +++++++++++++++++++++++++++++---
 include/linux/vmalloc.h        |    4 +
 include/uapi/linux/eventpoll.h |   15 +-
 mm/vmalloc.c                   |  152 ++++-
 4 files changed, 1117 insertions(+), 96 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
-- 
2.19.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 00/15] epoll: support pollable epoll from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Luis R. Rodriguez, Paul E. McKenney, Al Viro,
	Andrea Parri, Andrew Morton, Andrey Ryabinin, Davidlohr Bueso,
	Jason Baron, Joe Perches, Linus Torvalds, Michal Hocko,
	linux-fsdevel, linux-mm, linux-kernel

Hi all,

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call.  When ring is empty, user has to call epoll_wait()
in order to wait for new events.  epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user().  I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does).  What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

Also I required vrealloc(), which can hide all this "alloc new area - get
pages - map pages" stuff.  So vrealloc() is introduced in first 3 patches.

** Limitations
    
1. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
     obviously we can't call vfs_epoll() from userpace to have level
     triggered behaviour.
    
2. No support for EPOLLWAKEUP
     events are consumed from userspace, thus no way to call __pm_relax()
    
3. No support for EPOLLEXCLUSIVE
     If device does not pass pollflags to wake_up() there is no way to
     call poll() from the context under spinlock, thus special work is
     scheduled to offload polling.  In this specific case we can't
     support exclusive wakeups, because we do not know actual result
     of scheduled work and have to wake up every waiter.
    
4. No support for nesting of epoll descriptors polled from userspace
     no real good reason to scan ready events of user ring from the
     kernel, so just do not do that.


** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

 struct user_epitem {
    unsigned int ready_events;
    struct epoll_event event;
 };
 BUILD_BUG_ON(sizeof(struct user_epitem) != 16);

And the following is a header, which is seen by userspace:

 struct user_header {
    unsigned int magic;          /* epoll user header magic */
    unsigned int state;          /* epoll ring state */
    unsigned int header_length;  /* length of the header + items */
    unsigned int index_length;   /* length of the index ring */
    unsigned int max_items_nr;   /* max num of items slots */
    unsigned int max_index_nr;   /* max num of items indeces, always pow2 */
    unsigned int head;           /* updated by userland */
    unsigned int tail;           /* updated by kernel */
    unsigned int padding[24];    /* Header size is 128 bytes */

    struct user_epitem items[];
 };

 /* Header is 128 bytes, thus items are aligned on CPU cache */
 BUILD_BUG_ON(sizeof(struct user_header) != 128);

From the very beginning kernel allocates 1 page for user header, i.e. by
default we have 248 items for 4096 size page.

When 249'th item is inserted special expanding should be done, which will
be discussed later.

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

 unsinged int *user_index;

Kernel allocates also 1 page for user index, i.e. for 4096 page we have
1024 ring elements capacity.


* How is new event accounted on kernel side?  Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

 struct user_epitem *uitem;

 /* Each item has a bit (index in user items array), discussed later */
 uitem = user_header->items[epi->bit];

 if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
     i = atomic_add(&ep->user_header->tail, 1);

     item_idx = &user_index[i & index_mask];

     /* Signal with a bit, user spins on index expecting value > 0 */
     *item_idx = idx + 1;

    /*
     * Want index update be flushed from CPU write buffer and
     * immediately visible on userspace side to avoid long busy
     * loops.
     */
     smp_wmb();
 }

Important thing here is that ring can't infinitely grow and corrupt other
elements, because kernel always checks that item was marked as ready, so
userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

 tail = READ_ONCE(header->tail);
 for (i = 0; header->head != tail; header->head++) {
     item_idx_ptr = &index[idx & indeces_mask];

     /*
      * Spin here till we see valid index
      */
     while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
         ;

     item = &header->items[idx - 1];

     /*
      * Mark index as invalid, that is for userspace only, kernel does not care
      * and will refill this pointer only when observes that event is cleared,
      * which happens below.
      */
     *item_idx_ptr = 0;

     /*
      * Fetch data first, if event is cleared by the kernel we drop the data
      * returning false.
      */
     event->data = item->event.data;
     event->events = __atomic_exchange_n(&item->ready_events, 0,
                         __ATOMIC_RELEASE);

 }


* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item.  When bitmap is full - it has been expanded.

* What happens when user items or user index has to be expanded or shrunk?

For that quite rare cases kernel has to ask userspace to invoke epoll_wait()
in order to reallocate all user pointers under locks, i.e. for that
particular period all events are routed to kernel lists instead of user
ring and kernel sets special INACTIVE state in user header in order to
notify user that new event's won't appear in the ring until the user
calls epoll_wait().  Worth to mention, that expand is done directly inside
ep_insert(), because expand is an allocation of a new page and recreation
of virtual area on kernel side, which does not affect mappings on userside.

* How userspace detects that kernel has expanded or shrunk the memory?

Any of the item ctl operations (add, mod, del) can be executed in parallel
with events consumption from user ring.

Expand is safe from user perspective (new pages is mapped to kernel side,
but user does not know and care about that), so expand happens directly
in epoll_ctl(EPOLL_CTL_ADD), but kernel routes all new events to kernel
lists and asks user to call epoll_wait() with special INACTIVE state.

Shrink is a bit different.  When epoll_ctl(EPOLL_CTL_DEL) is called and
kernel decides to shrink the memory, it routes new events to kernel lists,
marks user header state as INACTIVE and does not put item bit immediately,
but postpones it until user calls epoll_wait() (which should happen soon,
because user_header->state is INACTIVE and user should come to sleep to
kernel).  So shrink happens only on epoll_wait() call with all necessary
locks taken.

Bit put should be postponed because user can observe corrupted event item
if events are not yet consumed from the ring, bit is put and then
immediately reused by concurrent item insert.  To avoid this possible
race bit put is postponed when header state is INACTIVE and all events
are routed to kernel lists.

So returning to the quesion: how userspace detects that kernel has changed
the memory?  User has to cache lengths before epoll_wait(), compare old
cached values with new from header and call mremap() if values differ:

 header_length = header->header_length;
 index_length = header->index_length;

 rc = epoll_wait(epfd, NULL, 0, -1);
 assert(rc < 0);
 if (errno != -ESTALE)
     return -errno;

 if (header_length != header->header_length) {
    header = mremap(header, header_length, header->header_length, MREMAP_MAYMOVE);
    assert(header != MAP_FAILED);
 }
 if (index_length != header->index_length) {
    index = mremap(index, index_length, header->index_length, MREMAP_MAYMOVE);
    assert(index != MAP_FAILED);
 }

* Is it possible to consume events from many threads on userspace side?

That should be possible in a lockless manner, and kernel keeps extra number
of free slots in a ring (EPOLL_USER_EXTRA_INDEX_NR = 16) in order to let
user consume events from up to 16 threads in parallel.

It seems that this can be a good feature thinking about performance, but I
could not decide is it enough to report this value in a user header or let
user change that somehow on epoll_create1() call (or a new one?).

* Is there any testing app available?

There is a small app [1] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.


This is RFC because for memory allocation I used vmalloc(), which virtual
space for kernel seems limited for some archs.  So for example for 1 mln
of items kernel has to allocate 10^6 x 16 [items] + 10^6 x 4 [index],
that is around ~20mb, seems very small, but not sure is it ok or not.

I temporarily used gcc atomic builtins on kernel side, because I did find
any good way to atomically update plain unsigned int of user_header
structure without casting it to atomic_t.  Or casting is fine in that case?

There are not enough good, informative and shiny comments in the code,
explaining all the machinery.  The most hard part is left, I would say.

Only very basic scenarios are tested, all these things with user
reallocations (expand, shrink) are not tested at all.

[1] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c

Roman Penyaev (15):
  mm/vmalloc: add new 'alignment' field for vm_struct structure
  mm/vmalloc: move common logic from  __vmalloc_area_node to a separate
    func
  mm/vmalloc: introduce new vrealloc() call and its subsidiary reach
    analog
  epoll: move private helpers from a header to the source
  epoll: introduce user header structure and user index for polling from
    userspace
  epoll: introduce various of helpers for user structure lengths
    calculations
  epoll: extend epitem struct with new members for polling from
    userspace
  epoll: some sanity flags checks for epoll syscalls for polled epfd
    from userspace
  epoll: introduce stand-alone helpers for polling from userspace
  epoll: support polling from userspace for ep_insert()
  epoll: offload polling to a work in case of epfd polled from userspace
  epoll: support polling from userspace for ep_remove()
  epoll: support polling from userspace for ep_modify()
  epoll: support polling from userspace for ep_poll()
  epoll: support mapping for epfd when polled from userspace

 fs/eventpoll.c                 | 1042 +++++++++++++++++++++++++++++---
 include/linux/vmalloc.h        |    4 +
 include/uapi/linux/eventpoll.h |   15 +-
 mm/vmalloc.c                   |  152 ++++-
 4 files changed, 1117 insertions(+), 96 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
-- 
2.19.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
  2019-01-09 16:40 ` Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40   ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
  3 siblings, 1 reply; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

I need a new alignment field for vm area in order to reallocate
previously allocated area with the same alignment.

Patch for a new vrealloc() call will follow and this new call
I want to keep as simple as possible, thus not to provide dozens
of variants, like vrealloc_user(), which cares about alignment.

Current changes are just preparations.

Worth to mention, that on archs were unsigned long is 64 bit
this new field does not bloat vm_struct, because originally
there was a padding between nr_pages and phys_addr.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 10 ++++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..78210aa0bb43 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -38,6 +38,7 @@ struct vm_struct {
 	unsigned long		flags;
 	struct page		**pages;
 	unsigned int		nr_pages;
+	unsigned int		alignment;
 	phys_addr_t		phys_addr;
 	const void		*caller;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e83961767dc1..4851b4a67f55 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1347,12 +1347,14 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 EXPORT_SYMBOL_GPL(map_vm_area);
 
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
-			      unsigned long flags, const void *caller)
+			     unsigned int align, unsigned long flags,
+			     const void *caller)
 {
 	spin_lock(&vmap_area_lock);
 	vm->flags = flags;
 	vm->addr = (void *)va->va_start;
 	vm->size = va->va_end - va->va_start;
+	vm->alignment = align;
 	vm->caller = caller;
 	va->vm = vm;
 	va->flags |= VM_VM_AREA;
@@ -1399,7 +1401,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 		return NULL;
 	}
 
-	setup_vmalloc_vm(area, va, flags, caller);
+	setup_vmalloc_vm(area, va, align, flags, caller);
 
 	return area;
 }
@@ -2601,8 +2603,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* insert all vm's */
 	for (area = 0; area < nr_vms; area++)
-		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
-				 pcpu_get_vm_areas);
+		setup_vmalloc_vm(vms[area], vas[area], align,
+				 VM_ALLOC, pcpu_get_vm_areas);
 
 	kfree(vas);
 	return vms;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
@ 2019-01-09 16:40   ` Roman Penyaev
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

I need a new alignment field for vm area in order to reallocate
previously allocated area with the same alignment.

Patch for a new vrealloc() call will follow and this new call
I want to keep as simple as possible, thus not to provide dozens
of variants, like vrealloc_user(), which cares about alignment.

Current changes are just preparations.

Worth to mention, that on archs were unsigned long is 64 bit
this new field does not bloat vm_struct, because originally
there was a padding between nr_pages and phys_addr.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 10 ++++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..78210aa0bb43 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -38,6 +38,7 @@ struct vm_struct {
 	unsigned long		flags;
 	struct page		**pages;
 	unsigned int		nr_pages;
+	unsigned int		alignment;
 	phys_addr_t		phys_addr;
 	const void		*caller;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e83961767dc1..4851b4a67f55 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1347,12 +1347,14 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 EXPORT_SYMBOL_GPL(map_vm_area);
 
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
-			      unsigned long flags, const void *caller)
+			     unsigned int align, unsigned long flags,
+			     const void *caller)
 {
 	spin_lock(&vmap_area_lock);
 	vm->flags = flags;
 	vm->addr = (void *)va->va_start;
 	vm->size = va->va_end - va->va_start;
+	vm->alignment = align;
 	vm->caller = caller;
 	va->vm = vm;
 	va->flags |= VM_VM_AREA;
@@ -1399,7 +1401,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 		return NULL;
 	}
 
-	setup_vmalloc_vm(area, va, flags, caller);
+	setup_vmalloc_vm(area, va, align, flags, caller);
 
 	return area;
 }
@@ -2601,8 +2603,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* insert all vm's */
 	for (area = 0; area < nr_vms; area++)
-		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
-				 pcpu_get_vm_areas);
+		setup_vmalloc_vm(vms[area], vas[area], align,
+				 VM_ALLOC, pcpu_get_vm_areas);
 
 	kfree(vas);
 	return vms;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 02/15] mm/vmalloc: move common logic from  __vmalloc_area_node to a separate func
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
  2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40   ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
  3 siblings, 1 reply; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

This one moves logic related to pages array creation to a separate
function, which will be used by vrealloc() call as well, which
implementation will follow.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmalloc.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4851b4a67f55..ad6cd807f6db 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1662,21 +1662,26 @@ EXPORT_SYMBOL(vmap);
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
 			    int node, const void *caller);
-static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
-				 pgprot_t prot, int node)
+
+static int alloc_vm_area_array(struct vm_struct *area, gfp_t gfp_mask, int node)
 {
+	unsigned int nr_pages, array_size;
 	struct page **pages;
-	unsigned int nr_pages, array_size, i;
+
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
-	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
 	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
 					0 :
 					__GFP_HIGHMEM;
 
+	if (WARN_ON(area->pages))
+		return -EINVAL;
+
 	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+	if (!nr_pages)
+		return -EINVAL;
+
 	array_size = (nr_pages * sizeof(struct page *));
 
-	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
 		pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
@@ -1684,8 +1689,25 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	} else {
 		pages = kmalloc_node(array_size, nested_gfp, node);
 	}
+	if (!pages)
+		return -ENOMEM;
+
+	area->nr_pages = nr_pages;
 	area->pages = pages;
-	if (!area->pages) {
+
+	return 0;
+}
+
+static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
+				 pgprot_t prot, int node)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
+					0 :
+					__GFP_HIGHMEM;
+	unsigned int i;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
 		remove_vm_area(area->addr);
 		kfree(area);
 		return NULL;
@@ -1709,7 +1731,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (map_vm_area(area, prot, area->pages))
 		goto fail;
 	return area->addr;
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 02/15] mm/vmalloc: move common logic from  __vmalloc_area_node to a separate func
  2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
@ 2019-01-09 16:40   ` Roman Penyaev
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

This one moves logic related to pages array creation to a separate
function, which will be used by vrealloc() call as well, which
implementation will follow.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmalloc.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4851b4a67f55..ad6cd807f6db 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1662,21 +1662,26 @@ EXPORT_SYMBOL(vmap);
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
 			    int node, const void *caller);
-static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
-				 pgprot_t prot, int node)
+
+static int alloc_vm_area_array(struct vm_struct *area, gfp_t gfp_mask, int node)
 {
+	unsigned int nr_pages, array_size;
 	struct page **pages;
-	unsigned int nr_pages, array_size, i;
+
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
-	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
 	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
 					0 :
 					__GFP_HIGHMEM;
 
+	if (WARN_ON(area->pages))
+		return -EINVAL;
+
 	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+	if (!nr_pages)
+		return -EINVAL;
+
 	array_size = (nr_pages * sizeof(struct page *));
 
-	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
 		pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
@@ -1684,8 +1689,25 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	} else {
 		pages = kmalloc_node(array_size, nested_gfp, node);
 	}
+	if (!pages)
+		return -ENOMEM;
+
+	area->nr_pages = nr_pages;
 	area->pages = pages;
-	if (!area->pages) {
+
+	return 0;
+}
+
+static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
+				 pgprot_t prot, int node)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
+					0 :
+					__GFP_HIGHMEM;
+	unsigned int i;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
 		remove_vm_area(area->addr);
 		kfree(area);
 		return NULL;
@@ -1709,7 +1731,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (map_vm_area(area, prot, area->pages))
 		goto fail;
 	return area->addr;
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (2 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40   ` Roman Penyaev
  2019-01-09 16:50   ` Matthew Wilcox
  3 siblings, 2 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

Function changes the size of virtual contigues memory, previously
allocated by vmalloc().

vrealloc() under the hood does the following:

 1. allocates new vm area based on the alignment of the old one.
 2. allocates pages array for a new vm area.
 3. fill in ->pages array taking pages from the old area increasing
    page ref.

    In case of virtual size grow (old_size < new_size) new pages
    for a new area are allocated using gfp passed by the caller.

Basically vrealloc() repeats glibc realloc() with only one big difference:
old area is not freed, i.e. caller is responsible for calling vfree() in
case of successfull reallocation.

Why vfree() is not called for old area directly from vrealloc()?  Because
sometimes it is better just to have transaction-like reallocation for
several pointers and reallocate all at once, i.e.:

  new_p1 = vrealloc(p1, new_len);
  new_p2 = vrealloc(p2, new_len);
  if (!new_p1 || !new_p2) {
	vfree(new_p1);
	vfree(new_p2);
	return -ENOMEM;
  }

  vfree(p1);
  vfree(p2);

  p1 = new_p1;
  p2 = new_p2;

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |   3 ++
 mm/vmalloc.c            | 106 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 78210aa0bb43..2902faf26c4f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -72,6 +72,7 @@ static inline void vmalloc_init(void)
 
 extern void *vmalloc(unsigned long size);
 extern void *vzalloc(unsigned long size);
+extern void *vrealloc(void *old_addr, unsigned long size);
 extern void *vmalloc_user(unsigned long size);
 extern void *vmalloc_node(unsigned long size, int node);
 extern void *vzalloc_node(unsigned long size, int node);
@@ -83,6 +84,8 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+			     pgprot_t prot, int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ad6cd807f6db..94cc99e780c7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1889,6 +1889,112 @@ void *vzalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vzalloc);
 
+void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+		      pgprot_t prot, int node, const void *caller)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?	0 :
+					__GFP_HIGHMEM;
+	struct vm_struct *old_area, *area;
+	struct page *page;
+
+	unsigned int i;
+
+	old_area = find_vm_area(old_addr);
+	if (!old_area)
+		return NULL;
+
+	if (!(old_area->flags & VM_ALLOC))
+		return NULL;
+
+	size = PAGE_ALIGN(size);
+	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
+		return NULL;
+
+	if (get_vm_area_size(old_area) == size)
+		return old_addr;
+
+	area = __get_vm_area_node(size, old_area->alignment, VM_UNINITIALIZED |
+				  old_area->flags, VMALLOC_START, VMALLOC_END,
+				  node, gfp_mask, caller);
+	if (!area)
+		return NULL;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
+		__vunmap(area->addr, 0);
+		return NULL;
+	}
+
+	for (i = 0; i < area->nr_pages; i++) {
+		if (i < old_area->nr_pages) {
+			/* Take a page from old area and increase a ref */
+
+			page = old_area->pages[i];
+			area->pages[i] = page;
+			get_page(page);
+		} else {
+			/* Allocate more pages in case of grow */
+
+			page = alloc_page(alloc_mask|highmem_mask);
+			if (unlikely(!page)) {
+				/*
+				 * Successfully allocated i pages, free
+				 * them in __vunmap()
+				 */
+				area->nr_pages = i;
+				goto fail;
+			}
+
+			area->pages[i] = page;
+			if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
+				cond_resched();
+		}
+	}
+	if (map_vm_area(area, prot, area->pages))
+		goto fail;
+
+	/* New area is fully ready */
+	clear_vm_uninitialized_flag(area);
+	kmemleak_vmalloc(area, size, gfp_mask);
+
+	return area->addr;
+
+fail:
+	warn_alloc(gfp_mask, NULL, "vrealloc: allocation failure");
+	__vfree(area->addr);
+
+	return NULL;
+}
+EXPORT_SYMBOL(__vrealloc_node);
+
+/**
+ *	vrealloc - reallocate virtually contiguous memory with zero fill
+ *	@old_addr:	old virtual address
+ *	@size:		new size
+ *
+ *	Allocate additional pages to cover new @size from the page level
+ *	allocator if memory grows. Then pages are mapped into a new
+ *	contiguous kernel virtual space, previous area is NOT freed.
+ *
+ *	Do not forget to call vfree() passing old address.  But careful,
+ *	calling vfree() from interrupt will cause vfree_deferred() call,
+ *	which in its turn uses freed address as a temporal pointer for a
+ *	llist element, i.e. memory will be corrupted.
+ *
+ *	If new size is equal to the old size - old pointer is returned.
+ *	I.e. appropriate check should be made before calling vfree().
+ *
+ *	For tight control over page level allocator and protection flags
+ *	use __vrealloc_node() instead.
+ */
+void *vrealloc(void *old_addr, unsigned long size)
+{
+	return __vrealloc_node(old_addr, size, GFP_KERNEL | __GFP_ZERO,
+			       PAGE_KERNEL, NUMA_NO_NODE,
+			       __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vrealloc);
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
@ 2019-01-09 16:40   ` Roman Penyaev
  2019-01-09 16:50   ` Matthew Wilcox
  1 sibling, 0 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

Function changes the size of virtual contigues memory, previously
allocated by vmalloc().

vrealloc() under the hood does the following:

 1. allocates new vm area based on the alignment of the old one.
 2. allocates pages array for a new vm area.
 3. fill in ->pages array taking pages from the old area increasing
    page ref.

    In case of virtual size grow (old_size < new_size) new pages
    for a new area are allocated using gfp passed by the caller.

Basically vrealloc() repeats glibc realloc() with only one big difference:
old area is not freed, i.e. caller is responsible for calling vfree() in
case of successfull reallocation.

Why vfree() is not called for old area directly from vrealloc()?  Because
sometimes it is better just to have transaction-like reallocation for
several pointers and reallocate all at once, i.e.:

  new_p1 = vrealloc(p1, new_len);
  new_p2 = vrealloc(p2, new_len);
  if (!new_p1 || !new_p2) {
	vfree(new_p1);
	vfree(new_p2);
	return -ENOMEM;
  }

  vfree(p1);
  vfree(p2);

  p1 = new_p1;
  p2 = new_p2;

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |   3 ++
 mm/vmalloc.c            | 106 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 78210aa0bb43..2902faf26c4f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -72,6 +72,7 @@ static inline void vmalloc_init(void)
 
 extern void *vmalloc(unsigned long size);
 extern void *vzalloc(unsigned long size);
+extern void *vrealloc(void *old_addr, unsigned long size);
 extern void *vmalloc_user(unsigned long size);
 extern void *vmalloc_node(unsigned long size, int node);
 extern void *vzalloc_node(unsigned long size, int node);
@@ -83,6 +84,8 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+			     pgprot_t prot, int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ad6cd807f6db..94cc99e780c7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1889,6 +1889,112 @@ void *vzalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vzalloc);
 
+void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+		      pgprot_t prot, int node, const void *caller)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?	0 :
+					__GFP_HIGHMEM;
+	struct vm_struct *old_area, *area;
+	struct page *page;
+
+	unsigned int i;
+
+	old_area = find_vm_area(old_addr);
+	if (!old_area)
+		return NULL;
+
+	if (!(old_area->flags & VM_ALLOC))
+		return NULL;
+
+	size = PAGE_ALIGN(size);
+	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
+		return NULL;
+
+	if (get_vm_area_size(old_area) == size)
+		return old_addr;
+
+	area = __get_vm_area_node(size, old_area->alignment, VM_UNINITIALIZED |
+				  old_area->flags, VMALLOC_START, VMALLOC_END,
+				  node, gfp_mask, caller);
+	if (!area)
+		return NULL;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
+		__vunmap(area->addr, 0);
+		return NULL;
+	}
+
+	for (i = 0; i < area->nr_pages; i++) {
+		if (i < old_area->nr_pages) {
+			/* Take a page from old area and increase a ref */
+
+			page = old_area->pages[i];
+			area->pages[i] = page;
+			get_page(page);
+		} else {
+			/* Allocate more pages in case of grow */
+
+			page = alloc_page(alloc_mask|highmem_mask);
+			if (unlikely(!page)) {
+				/*
+				 * Successfully allocated i pages, free
+				 * them in __vunmap()
+				 */
+				area->nr_pages = i;
+				goto fail;
+			}
+
+			area->pages[i] = page;
+			if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
+				cond_resched();
+		}
+	}
+	if (map_vm_area(area, prot, area->pages))
+		goto fail;
+
+	/* New area is fully ready */
+	clear_vm_uninitialized_flag(area);
+	kmemleak_vmalloc(area, size, gfp_mask);
+
+	return area->addr;
+
+fail:
+	warn_alloc(gfp_mask, NULL, "vrealloc: allocation failure");
+	__vfree(area->addr);
+
+	return NULL;
+}
+EXPORT_SYMBOL(__vrealloc_node);
+
+/**
+ *	vrealloc - reallocate virtually contiguous memory with zero fill
+ *	@old_addr:	old virtual address
+ *	@size:		new size
+ *
+ *	Allocate additional pages to cover new @size from the page level
+ *	allocator if memory grows. Then pages are mapped into a new
+ *	contiguous kernel virtual space, previous area is NOT freed.
+ *
+ *	Do not forget to call vfree() passing old address.  But careful,
+ *	calling vfree() from interrupt will cause vfree_deferred() call,
+ *	which in its turn uses freed address as a temporal pointer for a
+ *	llist element, i.e. memory will be corrupted.
+ *
+ *	If new size is equal to the old size - old pointer is returned.
+ *	I.e. appropriate check should be made before calling vfree().
+ *
+ *	For tight control over page level allocator and protection flags
+ *	use __vrealloc_node() instead.
+ */
+void *vrealloc(void *old_addr, unsigned long size)
+{
+	return __vrealloc_node(old_addr, size, GFP_KERNEL | __GFP_ZERO,
+			       PAGE_KERNEL, NUMA_NO_NODE,
+			       __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vrealloc);
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
  2019-01-09 16:40   ` Roman Penyaev
@ 2019-01-09 16:50   ` Matthew Wilcox
  2019-01-10 10:08     ` Roman Penyaev
  1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2019-01-09 16:50 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Andrew Morton, Michal Hocko, Andrey Ryabinin, Joe Perches,
	Luis R. Rodriguez, linux-mm, linux-kernel

On Wed, Jan 09, 2019 at 05:40:13PM +0100, Roman Penyaev wrote:
> Basically vrealloc() repeats glibc realloc() with only one big difference:
> old area is not freed, i.e. caller is responsible for calling vfree() in
> case of successfull reallocation.

Ouch.  Don't call it the same thing when you're providing such different
semantics.  I agree with you that the new semantics are useful ones,
I just want it called something else.  Maybe vcopy()?  vclone()?

> + *	Do not forget to call vfree() passing old address.  But careful,
> + *	calling vfree() from interrupt will cause vfree_deferred() call,
> + *	which in its turn uses freed address as a temporal pointer for a

"temporary", not temporal.

> + *	llist element, i.e. memory will be corrupted.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:50   ` Matthew Wilcox
@ 2019-01-10 10:08     ` Roman Penyaev
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Penyaev @ 2019-01-10 10:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Michal Hocko, Andrey Ryabinin, Joe Perches,
	Luis R. Rodriguez, linux-mm, linux-kernel

On 2019-01-09 17:50, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 05:40:13PM +0100, Roman Penyaev wrote:
>> Basically vrealloc() repeats glibc realloc() with only one big 
>> difference:
>> old area is not freed, i.e. caller is responsible for calling vfree() 
>> in
>> case of successfull reallocation.
> 
> Ouch.  Don't call it the same thing when you're providing such 
> different
> semantics.  I agree with you that the new semantics are useful ones,
> I just want it called something else.  Maybe vcopy()?  vclone()?

vclone(). I like vclone().  But Linus does not like this reallocation
under the hood for epoll (where this vrealloc() should have been used),
so seems that won't be needed at all.

> 
>> + *	Do not forget to call vfree() passing old address.  But careful,
>> + *	calling vfree() from interrupt will cause vfree_deferred() call,
>> + *	which in its turn uses freed address as a temporal pointer for a
> 
> "temporary", not temporal.

Ha! Now I got the difference.  Thanks, Mathew :)

> 
>> + *	llist element, i.e. memory will be corrupted.

--
Roman

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-10 10:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
2019-01-09 16:40 ` Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
2019-01-09 16:40   ` Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
2019-01-09 16:40   ` Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
2019-01-09 16:40   ` Roman Penyaev
2019-01-09 16:50   ` Matthew Wilcox
2019-01-10 10:08     ` Roman Penyaev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).