All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:27 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:27 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:27 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals
  2017-02-01  9:27 ` Michal Hocko
  (?)
@ 2017-02-01  9:27   ` Michal Hocko
  -1 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead. As the
iomap_apply delegates all the work down to the actor we have to hook
into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dax.c   | 5 +++++
 fs/iomap.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 413a91db9351..0e263dacf9cf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
 		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..691eada58b06 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
 
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead. As the
iomap_apply delegates all the work down to the actor we have to hook
into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dax.c   | 5 +++++
 fs/iomap.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 413a91db9351..0e263dacf9cf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
 		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..691eada58b06 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
 
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead. As the
iomap_apply delegates all the work down to the actor we have to hook
into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dax.c   | 5 +++++
 fs/iomap.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 413a91db9351..0e263dacf9cf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
 		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..691eada58b06 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
 
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read
  2017-02-01  9:27 ` Michal Hocko
  (?)
@ 2017-02-01  9:27   ` Michal Hocko
  -1 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

do_generic_file_read can be told to perform a large request from
userspace. If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous. Make
sure we rather go with a short read and allow the killed task to
terminate.

Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/filemap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 14bddd0d7fa4..2ba46f410c7c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1797,6 +1797,11 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 
 		cond_resched();
 find_page:
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			goto out;
+		}
+
 		page = find_get_page(mapping, index);
 		if (!page) {
 			page_cache_sync_readahead(mapping,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

do_generic_file_read can be told to perform a large request from
userspace. If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous. Make
sure we rather go with a short read and allow the killed task to
terminate.

Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/filemap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 14bddd0d7fa4..2ba46f410c7c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1797,6 +1797,11 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 
 		cond_resched();
 find_page:
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			goto out;
+		}
+
 		page = find_get_page(mapping, index);
 		if (!page) {
 			page_cache_sync_readahead(mapping,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

do_generic_file_read can be told to perform a large request from
userspace. If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous. Make
sure we rather go with a short read and allow the killed task to
terminate.

Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/filemap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 14bddd0d7fa4..2ba46f410c7c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1797,6 +1797,11 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 
 		cond_resched();
 find_page:
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			goto out;
+		}
+
 		page = find_get_page(mapping, index);
 		if (!page) {
 			page_cache_sync_readahead(mapping,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] vmalloc: back of when the current is killed
  2017-02-01  9:27 ` Michal Hocko
  (?)
@ 2017-02-01  9:27   ` Michal Hocko
  -1 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__vmalloc_area_node allocates pages to cover the requested vmalloc size.
This can be a lot of memory. If the current task is killed by the OOM
killer, and thus has an unlimited access to memory reserves, it can
consume all the memory theoretically. Fix this by checking for
fatal_signal_pending and back off early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..011b446f8758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1642,6 +1642,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
+		if (fatal_signal_pending(current)) {
+			area->nr_pages = i;
+			goto fail;
+		}
+
 		if (node == NUMA_NO_NODE)
 			page = alloc_page(alloc_mask);
 		else
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] vmalloc: back of when the current is killed
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__vmalloc_area_node allocates pages to cover the requested vmalloc size.
This can be a lot of memory. If the current task is killed by the OOM
killer, and thus has an unlimited access to memory reserves, it can
consume all the memory theoretically. Fix this by checking for
fatal_signal_pending and back off early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..011b446f8758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1642,6 +1642,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
+		if (fatal_signal_pending(current)) {
+			area->nr_pages = i;
+			goto fail;
+		}
+
 		if (node == NUMA_NO_NODE)
 			page = alloc_page(alloc_mask);
 		else
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] vmalloc: back of when the current is killed
@ 2017-02-01  9:27   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__vmalloc_area_node allocates pages to cover the requested vmalloc size.
This can be a lot of memory. If the current task is killed by the OOM
killer, and thus has an unlimited access to memory reserves, it can
consume all the memory theoretically. Fix this by checking for
fatal_signal_pending and back off early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..011b446f8758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1642,6 +1642,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
+		if (fatal_signal_pending(current)) {
+			area->nr_pages = i;
+			goto fail;
+		}
+
 		if (node == NUMA_NO_NODE)
 			page = alloc_page(alloc_mask);
 		else
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals
  2017-02-01  9:27   ` Michal Hocko
@ 2017-02-01  9:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals
@ 2017-02-01  9:28     ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read
  2017-02-01  9:27   ` Michal Hocko
@ 2017-02-01  9:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read
@ 2017-02-01  9:28     ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] vmalloc: back of when the current is killed
  2017-02-01  9:27   ` Michal Hocko
@ 2017-02-01  9:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] vmalloc: back of when the current is killed
@ 2017-02-01  9:28     ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-02-01  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Christoph Hellwig, Tetsuo Handa, Al Viro,
	linux-mm, linux-fsdevel, LKML, Michal Hocko

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] fix few OOM victim allocation runaways
  2017-02-01  9:27 ` Michal Hocko
@ 2017-02-01 11:49   ` Tetsuo Handa
  -1 siblings, 0 replies; 23+ messages in thread
From: Tetsuo Handa @ 2017-02-01 11:49 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hch, viro, linux-mm, linux-fsdevel, linux-kernel

Michal Hocko wrote:
> Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
> allocation rather than consuming the reserves. I see two problems with
> this approach.
>         1) in order this flags work as expected all the blocking
>         operations in the allocator call chain (including the direct
>         reclaim) would have to be killable and this is really non
>         trivial to achieve. Especially when we do not have any control
>         over shrinkers.
>         2) even if the above could be dealt with we would still have to
>         find all the places which do allocation in the loop based on
>         the user request. So it wouldn't be simpler than an explicit
>         fatal_signal_pending check.
> 
> Thoughts?

I don't think they are problems.
I think it is OK to start __GFP_KILLABLE with best-effort implementation.

(1) is a sign of "direct reclaim considered harmful". Like I demonstrated
in "mm/page_alloc: Wait for oom_lock before retrying." thread, it is trivial
to consume all CPU times by letting far many threads (than available CPUs)
perform direct reclaim. It can cause more IPIs in drain_all_pages() than
needed (which we are considering workqueue in "mm, page_alloc: drain per-cpu
pages from workqueue context" thread). There is no need to let all allocating
threads do direct reclaim. By offloading direct reclaim to dedicated kernel
threads (we are talking about slowpath where allocating threads need to wait
for reclaim operation, so overhead of context switching should be acceptable),
we will be able to manage dependency better within reclaim operation (e.g.
propagate allowable level of memory reserves to use) compared to current
situation (e.g. let wb_workfn being blocked for more than a minute due to
doing direct reclaim observed in "Bug 192981 - page allocation stalls" thread).

----------
2017-01-28T04:05:57.064278+03:00 storage8 [154827.258547] MemAlloc: kworker/u33:3(27274) flags=0x4a08860 switches=22584 seq=158 gfp=0x26012d0(GFP_TEMPORARY|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=22745 uninterruptible
2017-01-28T04:05:57.064278+03:00 storage8 [154827.258925] kworker/u33:3   D
2017-01-28T04:05:57.064420+03:00 storage8     0 27274      2 0x00000000
2017-01-28T04:05:57.064420+03:00 storage8 [154827.259062] Workqueue: writeback wb_workfn
2017-01-28T04:05:57.064442+03:00 storage8  (flush-66:80)
2017-01-28T04:05:57.064542+03:00 storage8
2017-01-28T04:05:57.064543+03:00 storage8 [154827.259195]  0000000000000000
2017-01-28T04:05:57.064597+03:00 storage8  ffff924f6fc21e00
2017-01-28T04:05:57.064606+03:00 storage8  ffff924f882b8000
2017-01-28T04:05:57.064693+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:05:57.064821+03:00 storage8 [154827.259465]  ffff924f8f114240
2017-01-28T04:05:57.064815+03:00 storage8
2017-01-28T04:05:57.064815+03:00 storage8  ffffa53bc23c2ff8
2017-01-28T04:05:57.064829+03:00 storage8  ffffffffaf82f819
2017-01-28T04:05:57.064957+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:05:57.065077+03:00 storage8
2017-01-28T04:05:57.065078+03:00 storage8 [154827.259724]  0000000000000002
2017-01-28T04:05:57.065154+03:00 storage8  7fffffffffffffff
2017-01-28T04:05:57.065154+03:00 storage8  ffffffffaf82fd42
2017-01-28T04:05:57.065219+03:00 storage8  ffff9259278bc858
2017-01-28T04:05:57.065338+03:00 storage8
2017-01-28T04:05:57.065460+03:00 storage8 [154827.259984] Call Trace:
2017-01-28T04:05:57.065632+03:00 storage8 [154827.260120]  [<ffffffffaf82f819>] ? __schedule+0x179/0x5c8
2017-01-28T04:05:57.065873+03:00 storage8 [154827.260257]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:05:57.065873+03:00 storage8 [154827.260395]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.066023+03:00 storage8 [154827.260537]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:05:57.066201+03:00 storage8 [154827.260688]  [<ffffffffaf8322f5>] ? schedule_timeout+0x1a5/0x2a0
2017-01-28T04:05:57.066329+03:00 storage8 [154827.260837]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.066467+03:00 storage8 [154827.260978]  [<ffffffffaf82f62d>] ? io_schedule_timeout+0x9d/0x110
2017-01-28T04:05:57.066611+03:00 storage8 [154827.261119]  [<ffffffffaf388d58>] ? xfs_iunpin_wait+0x128/0x1a0
2017-01-28T04:05:57.066830+03:00 storage8 [154827.261261]  [<ffffffffaf0d47d0>] ? wake_atomic_t_function+0x40/0x40
2017-01-28T04:05:57.066888+03:00 storage8 [154827.261404]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.067025+03:00 storage8 [154827.261542]  [<ffffffffaf37b8e4>] ? xfs_reclaim_inodes_ag+0x1b4/0x2c0
2017-01-28T04:05:57.067167+03:00 storage8 [154827.261683]  [<ffffffffaf37ce31>] ? xfs_reclaim_inodes_nr+0x31/0x40
2017-01-28T04:05:57.067321+03:00 storage8 [154827.261825]  [<ffffffffaf20b030>] ? super_cache_scan+0x1a0/0x1b0
2017-01-28T04:05:57.067447+03:00 storage8 [154827.261965]  [<ffffffffaf195cc2>] ? shrink_slab+0x262/0x440
2017-01-28T04:05:57.067582+03:00 storage8 [154827.262103]  [<ffffffffaf0bd4af>] ? try_to_wake_up+0x1df/0x370
2017-01-28T04:05:57.067723+03:00 storage8 [154827.262239]  [<ffffffffaf19966f>] ? shrink_node+0xef/0x2d0
2017-01-28T04:05:57.067874+03:00 storage8 [154827.262377]  [<ffffffffaf199b44>] ? do_try_to_free_pages+0xc4/0x2e0
2017-01-28T04:05:57.068001+03:00 storage8 [154827.262518]  [<ffffffffaf19a014>] ? try_to_free_pages+0xe4/0x1c0
2017-01-28T04:05:57.068150+03:00 storage8 [154827.262657]  [<ffffffffaf18a6eb>] ? __alloc_pages_nodemask+0x78b/0xe50
2017-01-28T04:05:57.068282+03:00 storage8 [154827.262801]  [<ffffffffaf2c8873>] ? __ext4_journal_stop+0x83/0xc0
2017-01-28T04:05:57.068423+03:00 storage8 [154827.262938]  [<ffffffffaf1e27c3>] ? kmem_cache_alloc+0x113/0x1b0
2017-01-28T04:05:57.068571+03:00 storage8 [154827.263079]  [<ffffffffaf1d855a>] ? alloc_pages_current+0x9a/0x120
2017-01-28T04:05:57.068735+03:00 storage8 [154827.263216]  [<ffffffffaf1e03fb>] ? new_slab+0x39b/0x600
2017-01-28T04:05:57.068836+03:00 storage8 [154827.263354]  [<ffffffffaf42177e>] ? bio_attempt_back_merge+0x8e/0x110
2017-01-28T04:05:57.069018+03:00 storage8 [154827.263502]  [<ffffffffaf1e1a84>] ? ___slab_alloc+0x3e4/0x580
2017-01-28T04:05:57.069267+03:00 storage8 [154827.263642]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069292+03:00 storage8 [154827.263787]  [<ffffffffaf420675>] ? generic_make_request+0x105/0x190
2017-01-28T04:05:57.069405+03:00 storage8 [154827.263926]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069553+03:00 storage8 [154827.264065]  [<ffffffffaf20452f>] ? __slab_alloc+0xe/0x12
2017-01-28T04:05:57.069688+03:00 storage8 [154827.264203]  [<ffffffffaf1e2856>] ? kmem_cache_alloc+0x1a6/0x1b0
2017-01-28T04:05:57.069834+03:00 storage8 [154827.264342]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069979+03:00 storage8 [154827.264482]  [<ffffffffaf29c4f8>] ? ext4_writepages+0x438/0xd80
2017-01-28T04:05:57.070106+03:00 storage8 [154827.264621]  [<ffffffffaf0d40a0>] ? __wake_up_common+0x50/0x90
2017-01-28T04:05:57.070249+03:00 storage8 [154827.264761]  [<ffffffffaf23462d>] ? __writeback_single_inode+0x3d/0x340
2017-01-28T04:05:57.070385+03:00 storage8 [154827.264903]  [<ffffffffaf235091>] ? writeback_sb_inodes+0x1e1/0x440
2017-01-28T04:05:57.070529+03:00 storage8 [154827.265040]  [<ffffffffaf23537d>] ? __writeback_inodes_wb+0x8d/0xc0
2017-01-28T04:05:57.070660+03:00 storage8 [154827.265178]  [<ffffffffaf2355e7>] ? wb_writeback+0x237/0x2c0
2017-01-28T04:05:57.070807+03:00 storage8 [154827.265317]  [<ffffffffaf235d96>] ? wb_workfn+0x1f6/0x370
2017-01-28T04:05:57.070965+03:00 storage8 [154827.265456]  [<ffffffffaf0ad2a4>] ? process_one_work+0x124/0x3b0
2017-01-28T04:05:57.071078+03:00 storage8 [154827.265594]  [<ffffffffaf0ad693>] ? worker_thread+0x123/0x470
2017-01-28T04:05:57.071219+03:00 storage8 [154827.265733]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:05:57.071380+03:00 storage8 [154827.265881]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:05:57.071526+03:00 storage8 [154827.266024]  [<ffffffffaf0b3672>] ? kthread+0xc2/0xe0
2017-01-28T04:05:57.071638+03:00 storage8 [154827.266160]  [<ffffffffaf0b35b0>] ? __kthread_init_worker+0xb0/0xb0
2017-01-28T04:05:57.071781+03:00 storage8 [154827.266300]  [<ffffffffaf833662>] ? ret_from_fork+0x22/0x30

2017-01-28T04:08:40.702099+03:00 storage8 [154990.895457] MemAlloc: kworker/u33:3(27274) flags=0x4a08860 switches=23547 seq=158 gfp=0x26012d0(GFP_TEMPORARY|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=71833 uninterruptible
2017-01-28T04:08:40.702105+03:00 storage8 [154990.895835] kworker/u33:3   D
2017-01-28T04:08:40.702248+03:00 storage8     0 27274      2 0x00000000
2017-01-28T04:08:40.702256+03:00 storage8 [154990.895970] Workqueue: writeback wb_workfn
2017-01-28T04:08:40.702263+03:00 storage8  (flush-66:80)
2017-01-28T04:08:40.702360+03:00 storage8
2017-01-28T04:08:40.702360+03:00 storage8 [154990.896102]  0000000000000000
2017-01-28T04:08:40.702371+03:00 storage8  ffff924bbb3c0b40
2017-01-28T04:08:40.702371+03:00 storage8  ffff924f882ba4c0
2017-01-28T04:08:40.702497+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:08:40.702639+03:00 storage8
2017-01-28T04:08:40.702647+03:00 storage8 [154990.896357]  ffff924f8f914240
2017-01-28T04:08:40.702647+03:00 storage8  ffffffffaf82f819
2017-01-28T04:08:40.702650+03:00 storage8  ffffa53bc23c2fe8
2017-01-28T04:08:40.702753+03:00 storage8  0000000000000000
2017-01-28T04:08:40.702891+03:00 storage8
2017-01-28T04:08:40.702891+03:00 storage8 [154990.896611]  ffff924666c78b60
2017-01-28T04:08:40.702904+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:08:40.702904+03:00 storage8  0000000000000002
2017-01-28T04:08:40.703008+03:00 storage8  7fffffffffffffff
2017-01-28T04:08:40.703123+03:00 storage8
2017-01-28T04:08:40.703245+03:00 storage8 [154990.896866] Call Trace:
2017-01-28T04:08:40.703376+03:00 storage8 [154990.896988]  [<ffffffffaf82f819>] ? __schedule+0x179/0x5c8
2017-01-28T04:08:40.703498+03:00 storage8 [154990.897113]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:08:40.703638+03:00 storage8 [154990.897240]  [<ffffffffaf39a70d>] ? _xfs_log_force_lsn+0x1cd/0x340
2017-01-28T04:08:40.703763+03:00 storage8 [154990.897371]  [<ffffffffaf0bd640>] ? try_to_wake_up+0x370/0x370
2017-01-28T04:08:40.703913+03:00 storage8 [154990.897509]  [<ffffffffaf388d23>] ? xfs_iunpin_wait+0xf3/0x1a0
2017-01-28T04:08:40.704489+03:00 storage8 [154990.897834]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:08:40.704482+03:00 storage8 [154990.897637]  [<ffffffffaf39a8ca>] ? xfs_log_force_lsn+0x4a/0x100
2017-01-28T04:08:40.704482+03:00 storage8 [154990.897961]  [<ffffffffaf388d23>] ? xfs_iunpin_wait+0xf3/0x1a0
2017-01-28T04:08:40.704496+03:00 storage8 [154990.898089]  [<ffffffffaf0d47d0>] ? wake_atomic_t_function+0x40/0x40
2017-01-28T04:08:40.704610+03:00 storage8 [154990.898216]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:08:40.704737+03:00 storage8 [154990.898345]  [<ffffffffaf37b8e4>] ? xfs_reclaim_inodes_ag+0x1b4/0x2c0
2017-01-28T04:08:40.704863+03:00 storage8 [154990.898475]  [<ffffffffaf37ce31>] ? xfs_reclaim_inodes_nr+0x31/0x40
2017-01-28T04:08:40.704993+03:00 storage8 [154990.898606]  [<ffffffffaf20b030>] ? super_cache_scan+0x1a0/0x1b0
2017-01-28T04:08:40.705127+03:00 storage8 [154990.898734]  [<ffffffffaf195cc2>] ? shrink_slab+0x262/0x440
2017-01-28T04:08:40.705249+03:00 storage8 [154990.898861]  [<ffffffffaf0bd4af>] ? try_to_wake_up+0x1df/0x370
2017-01-28T04:08:40.705403+03:00 storage8 [154990.898993]  [<ffffffffaf19966f>] ? shrink_node+0xef/0x2d0
2017-01-28T04:08:40.705524+03:00 storage8 [154990.899133]  [<ffffffffaf199b44>] ? do_try_to_free_pages+0xc4/0x2e0
2017-01-28T04:08:40.705663+03:00 storage8 [154990.899265]  [<ffffffffaf19a014>] ? try_to_free_pages+0xe4/0x1c0
2017-01-28T04:08:40.705792+03:00 storage8 [154990.899394]  [<ffffffffaf18a6eb>] ? __alloc_pages_nodemask+0x78b/0xe50
2017-01-28T04:08:40.705912+03:00 storage8 [154990.899526]  [<ffffffffaf2c8873>] ? __ext4_journal_stop+0x83/0xc0
2017-01-28T04:08:40.706061+03:00 storage8 [154990.899655]  [<ffffffffaf1e27c3>] ? kmem_cache_alloc+0x113/0x1b0
2017-01-28T04:08:40.706176+03:00 storage8 [154990.899784]  [<ffffffffaf1d855a>] ? alloc_pages_current+0x9a/0x120
2017-01-28T04:08:40.706316+03:00 storage8 [154990.899910]  [<ffffffffaf1e03fb>] ? new_slab+0x39b/0x600
2017-01-28T04:08:40.706432+03:00 storage8 [154990.900037]  [<ffffffffaf42177e>] ? bio_attempt_back_merge+0x8e/0x110
2017-01-28T04:08:40.706557+03:00 storage8 [154990.900168]  [<ffffffffaf1e1a84>] ? ___slab_alloc+0x3e4/0x580
2017-01-28T04:08:40.706686+03:00 storage8 [154990.900297]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.706821+03:00 storage8 [154990.900425]  [<ffffffffaf420675>] ? generic_make_request+0x105/0x190
2017-01-28T04:08:40.706943+03:00 storage8 [154990.900556]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.707077+03:00 storage8 [154990.900684]  [<ffffffffaf20452f>] ? __slab_alloc+0xe/0x12
2017-01-28T04:08:40.707245+03:00 storage8 [154990.900816]  [<ffffffffaf1e2856>] ? kmem_cache_alloc+0x1a6/0x1b0
2017-01-28T04:08:40.707337+03:00 storage8 [154990.900945]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.707471+03:00 storage8 [154990.901073]  [<ffffffffaf29c4f8>] ? ext4_writepages+0x438/0xd80
2017-01-28T04:08:40.707603+03:00 storage8 [154990.901205]  [<ffffffffaf0d40a0>] ? __wake_up_common+0x50/0x90
2017-01-28T04:08:40.707743+03:00 storage8 [154990.901344]  [<ffffffffaf23462d>] ? __writeback_single_inode+0x3d/0x340
2017-01-28T04:08:40.707872+03:00 storage8 [154990.901481]  [<ffffffffaf235091>] ? writeback_sb_inodes+0x1e1/0x440
2017-01-28T04:08:40.708002+03:00 storage8 [154990.901615]  [<ffffffffaf23537d>] ? __writeback_inodes_wb+0x8d/0xc0
2017-01-28T04:08:40.708136+03:00 storage8 [154990.901742]  [<ffffffffaf2355e7>] ? wb_writeback+0x237/0x2c0
2017-01-28T04:08:40.708264+03:00 storage8 [154990.901873]  [<ffffffffaf235d96>] ? wb_workfn+0x1f6/0x370
2017-01-28T04:08:40.708390+03:00 storage8 [154990.902006]  [<ffffffffaf0ad2a4>] ? process_one_work+0x124/0x3b0
2017-01-28T04:08:40.708538+03:00 storage8 [154990.902134]  [<ffffffffaf0ad693>] ? worker_thread+0x123/0x470
2017-01-28T04:08:40.708652+03:00 storage8 [154990.902263]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:08:40.708785+03:00 storage8 [154990.902394]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:08:40.708929+03:00 storage8 [154990.902526]  [<ffffffffaf0b3672>] ? kthread+0xc2/0xe0
2017-01-28T04:08:40.709044+03:00 storage8 [154990.902655]  [<ffffffffaf0b35b0>] ? __kthread_init_worker+0xb0/0xb0
2017-01-28T04:08:40.709181+03:00 storage8 [154990.902784]  [<ffffffffaf833662>] ? ret_from_fork+0x22/0x30
----------

And why do you want to limit __GFP_KILLABLE to "do allocation in the loop" in (2) ?
Any single allocation can use __GFP_KILLABLE. Not allocating from memory reserves
(even it is a single page) will reduce the possibility of falling into OOM livelock on
CONFIG_MMU=n kernels, reduce the possibility of unwanted allocation stalls/failures,
for it will help preserving memory reserves for allocation requests which are really
important.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01 11:49   ` Tetsuo Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Tetsuo Handa @ 2017-02-01 11:49 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hch, viro, linux-mm, linux-fsdevel, linux-kernel

Michal Hocko wrote:
> Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
> allocation rather than consuming the reserves. I see two problems with
> this approach.
>         1) in order this flags work as expected all the blocking
>         operations in the allocator call chain (including the direct
>         reclaim) would have to be killable and this is really non
>         trivial to achieve. Especially when we do not have any control
>         over shrinkers.
>         2) even if the above could be dealt with we would still have to
>         find all the places which do allocation in the loop based on
>         the user request. So it wouldn't be simpler than an explicit
>         fatal_signal_pending check.
> 
> Thoughts?

I don't think they are problems.
I think it is OK to start __GFP_KILLABLE with best-effort implementation.

(1) is a sign of "direct reclaim considered harmful". Like I demonstrated
in "mm/page_alloc: Wait for oom_lock before retrying." thread, it is trivial
to consume all CPU times by letting far many threads (than available CPUs)
perform direct reclaim. It can cause more IPIs in drain_all_pages() than
needed (which we are considering workqueue in "mm, page_alloc: drain per-cpu
pages from workqueue context" thread). There is no need to let all allocating
threads do direct reclaim. By offloading direct reclaim to dedicated kernel
threads (we are talking about slowpath where allocating threads need to wait
for reclaim operation, so overhead of context switching should be acceptable),
we will be able to manage dependency better within reclaim operation (e.g.
propagate allowable level of memory reserves to use) compared to current
situation (e.g. let wb_workfn being blocked for more than a minute due to
doing direct reclaim observed in "Bug 192981 - page allocation stalls" thread).

----------
2017-01-28T04:05:57.064278+03:00 storage8 [154827.258547] MemAlloc: kworker/u33:3(27274) flags=0x4a08860 switches=22584 seq=158 gfp=0x26012d0(GFP_TEMPORARY|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=22745 uninterruptible
2017-01-28T04:05:57.064278+03:00 storage8 [154827.258925] kworker/u33:3   D
2017-01-28T04:05:57.064420+03:00 storage8     0 27274      2 0x00000000
2017-01-28T04:05:57.064420+03:00 storage8 [154827.259062] Workqueue: writeback wb_workfn
2017-01-28T04:05:57.064442+03:00 storage8  (flush-66:80)
2017-01-28T04:05:57.064542+03:00 storage8
2017-01-28T04:05:57.064543+03:00 storage8 [154827.259195]  0000000000000000
2017-01-28T04:05:57.064597+03:00 storage8  ffff924f6fc21e00
2017-01-28T04:05:57.064606+03:00 storage8  ffff924f882b8000
2017-01-28T04:05:57.064693+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:05:57.064821+03:00 storage8 [154827.259465]  ffff924f8f114240
2017-01-28T04:05:57.064815+03:00 storage8
2017-01-28T04:05:57.064815+03:00 storage8  ffffa53bc23c2ff8
2017-01-28T04:05:57.064829+03:00 storage8  ffffffffaf82f819
2017-01-28T04:05:57.064957+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:05:57.065077+03:00 storage8
2017-01-28T04:05:57.065078+03:00 storage8 [154827.259724]  0000000000000002
2017-01-28T04:05:57.065154+03:00 storage8  7fffffffffffffff
2017-01-28T04:05:57.065154+03:00 storage8  ffffffffaf82fd42
2017-01-28T04:05:57.065219+03:00 storage8  ffff9259278bc858
2017-01-28T04:05:57.065338+03:00 storage8
2017-01-28T04:05:57.065460+03:00 storage8 [154827.259984] Call Trace:
2017-01-28T04:05:57.065632+03:00 storage8 [154827.260120]  [<ffffffffaf82f819>] ? __schedule+0x179/0x5c8
2017-01-28T04:05:57.065873+03:00 storage8 [154827.260257]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:05:57.065873+03:00 storage8 [154827.260395]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.066023+03:00 storage8 [154827.260537]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:05:57.066201+03:00 storage8 [154827.260688]  [<ffffffffaf8322f5>] ? schedule_timeout+0x1a5/0x2a0
2017-01-28T04:05:57.066329+03:00 storage8 [154827.260837]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.066467+03:00 storage8 [154827.260978]  [<ffffffffaf82f62d>] ? io_schedule_timeout+0x9d/0x110
2017-01-28T04:05:57.066611+03:00 storage8 [154827.261119]  [<ffffffffaf388d58>] ? xfs_iunpin_wait+0x128/0x1a0
2017-01-28T04:05:57.066830+03:00 storage8 [154827.261261]  [<ffffffffaf0d47d0>] ? wake_atomic_t_function+0x40/0x40
2017-01-28T04:05:57.066888+03:00 storage8 [154827.261404]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:05:57.067025+03:00 storage8 [154827.261542]  [<ffffffffaf37b8e4>] ? xfs_reclaim_inodes_ag+0x1b4/0x2c0
2017-01-28T04:05:57.067167+03:00 storage8 [154827.261683]  [<ffffffffaf37ce31>] ? xfs_reclaim_inodes_nr+0x31/0x40
2017-01-28T04:05:57.067321+03:00 storage8 [154827.261825]  [<ffffffffaf20b030>] ? super_cache_scan+0x1a0/0x1b0
2017-01-28T04:05:57.067447+03:00 storage8 [154827.261965]  [<ffffffffaf195cc2>] ? shrink_slab+0x262/0x440
2017-01-28T04:05:57.067582+03:00 storage8 [154827.262103]  [<ffffffffaf0bd4af>] ? try_to_wake_up+0x1df/0x370
2017-01-28T04:05:57.067723+03:00 storage8 [154827.262239]  [<ffffffffaf19966f>] ? shrink_node+0xef/0x2d0
2017-01-28T04:05:57.067874+03:00 storage8 [154827.262377]  [<ffffffffaf199b44>] ? do_try_to_free_pages+0xc4/0x2e0
2017-01-28T04:05:57.068001+03:00 storage8 [154827.262518]  [<ffffffffaf19a014>] ? try_to_free_pages+0xe4/0x1c0
2017-01-28T04:05:57.068150+03:00 storage8 [154827.262657]  [<ffffffffaf18a6eb>] ? __alloc_pages_nodemask+0x78b/0xe50
2017-01-28T04:05:57.068282+03:00 storage8 [154827.262801]  [<ffffffffaf2c8873>] ? __ext4_journal_stop+0x83/0xc0
2017-01-28T04:05:57.068423+03:00 storage8 [154827.262938]  [<ffffffffaf1e27c3>] ? kmem_cache_alloc+0x113/0x1b0
2017-01-28T04:05:57.068571+03:00 storage8 [154827.263079]  [<ffffffffaf1d855a>] ? alloc_pages_current+0x9a/0x120
2017-01-28T04:05:57.068735+03:00 storage8 [154827.263216]  [<ffffffffaf1e03fb>] ? new_slab+0x39b/0x600
2017-01-28T04:05:57.068836+03:00 storage8 [154827.263354]  [<ffffffffaf42177e>] ? bio_attempt_back_merge+0x8e/0x110
2017-01-28T04:05:57.069018+03:00 storage8 [154827.263502]  [<ffffffffaf1e1a84>] ? ___slab_alloc+0x3e4/0x580
2017-01-28T04:05:57.069267+03:00 storage8 [154827.263642]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069292+03:00 storage8 [154827.263787]  [<ffffffffaf420675>] ? generic_make_request+0x105/0x190
2017-01-28T04:05:57.069405+03:00 storage8 [154827.263926]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069553+03:00 storage8 [154827.264065]  [<ffffffffaf20452f>] ? __slab_alloc+0xe/0x12
2017-01-28T04:05:57.069688+03:00 storage8 [154827.264203]  [<ffffffffaf1e2856>] ? kmem_cache_alloc+0x1a6/0x1b0
2017-01-28T04:05:57.069834+03:00 storage8 [154827.264342]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:05:57.069979+03:00 storage8 [154827.264482]  [<ffffffffaf29c4f8>] ? ext4_writepages+0x438/0xd80
2017-01-28T04:05:57.070106+03:00 storage8 [154827.264621]  [<ffffffffaf0d40a0>] ? __wake_up_common+0x50/0x90
2017-01-28T04:05:57.070249+03:00 storage8 [154827.264761]  [<ffffffffaf23462d>] ? __writeback_single_inode+0x3d/0x340
2017-01-28T04:05:57.070385+03:00 storage8 [154827.264903]  [<ffffffffaf235091>] ? writeback_sb_inodes+0x1e1/0x440
2017-01-28T04:05:57.070529+03:00 storage8 [154827.265040]  [<ffffffffaf23537d>] ? __writeback_inodes_wb+0x8d/0xc0
2017-01-28T04:05:57.070660+03:00 storage8 [154827.265178]  [<ffffffffaf2355e7>] ? wb_writeback+0x237/0x2c0
2017-01-28T04:05:57.070807+03:00 storage8 [154827.265317]  [<ffffffffaf235d96>] ? wb_workfn+0x1f6/0x370
2017-01-28T04:05:57.070965+03:00 storage8 [154827.265456]  [<ffffffffaf0ad2a4>] ? process_one_work+0x124/0x3b0
2017-01-28T04:05:57.071078+03:00 storage8 [154827.265594]  [<ffffffffaf0ad693>] ? worker_thread+0x123/0x470
2017-01-28T04:05:57.071219+03:00 storage8 [154827.265733]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:05:57.071380+03:00 storage8 [154827.265881]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:05:57.071526+03:00 storage8 [154827.266024]  [<ffffffffaf0b3672>] ? kthread+0xc2/0xe0
2017-01-28T04:05:57.071638+03:00 storage8 [154827.266160]  [<ffffffffaf0b35b0>] ? __kthread_init_worker+0xb0/0xb0
2017-01-28T04:05:57.071781+03:00 storage8 [154827.266300]  [<ffffffffaf833662>] ? ret_from_fork+0x22/0x30

2017-01-28T04:08:40.702099+03:00 storage8 [154990.895457] MemAlloc: kworker/u33:3(27274) flags=0x4a08860 switches=23547 seq=158 gfp=0x26012d0(GFP_TEMPORARY|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=71833 uninterruptible
2017-01-28T04:08:40.702105+03:00 storage8 [154990.895835] kworker/u33:3   D
2017-01-28T04:08:40.702248+03:00 storage8     0 27274      2 0x00000000
2017-01-28T04:08:40.702256+03:00 storage8 [154990.895970] Workqueue: writeback wb_workfn
2017-01-28T04:08:40.702263+03:00 storage8  (flush-66:80)
2017-01-28T04:08:40.702360+03:00 storage8
2017-01-28T04:08:40.702360+03:00 storage8 [154990.896102]  0000000000000000
2017-01-28T04:08:40.702371+03:00 storage8  ffff924bbb3c0b40
2017-01-28T04:08:40.702371+03:00 storage8  ffff924f882ba4c0
2017-01-28T04:08:40.702497+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:08:40.702639+03:00 storage8
2017-01-28T04:08:40.702647+03:00 storage8 [154990.896357]  ffff924f8f914240
2017-01-28T04:08:40.702647+03:00 storage8  ffffffffaf82f819
2017-01-28T04:08:40.702650+03:00 storage8  ffffa53bc23c2fe8
2017-01-28T04:08:40.702753+03:00 storage8  0000000000000000
2017-01-28T04:08:40.702891+03:00 storage8
2017-01-28T04:08:40.702891+03:00 storage8 [154990.896611]  ffff924666c78b60
2017-01-28T04:08:40.702904+03:00 storage8  ffff924796b5a4c0
2017-01-28T04:08:40.702904+03:00 storage8  0000000000000002
2017-01-28T04:08:40.703008+03:00 storage8  7fffffffffffffff
2017-01-28T04:08:40.703123+03:00 storage8
2017-01-28T04:08:40.703245+03:00 storage8 [154990.896866] Call Trace:
2017-01-28T04:08:40.703376+03:00 storage8 [154990.896988]  [<ffffffffaf82f819>] ? __schedule+0x179/0x5c8
2017-01-28T04:08:40.703498+03:00 storage8 [154990.897113]  [<ffffffffaf82fd42>] ? schedule+0x32/0x80
2017-01-28T04:08:40.703638+03:00 storage8 [154990.897240]  [<ffffffffaf39a70d>] ? _xfs_log_force_lsn+0x1cd/0x340
2017-01-28T04:08:40.703763+03:00 storage8 [154990.897371]  [<ffffffffaf0bd640>] ? try_to_wake_up+0x370/0x370
2017-01-28T04:08:40.703913+03:00 storage8 [154990.897509]  [<ffffffffaf388d23>] ? xfs_iunpin_wait+0xf3/0x1a0
2017-01-28T04:08:40.704489+03:00 storage8 [154990.897834]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:08:40.704482+03:00 storage8 [154990.897637]  [<ffffffffaf39a8ca>] ? xfs_log_force_lsn+0x4a/0x100
2017-01-28T04:08:40.704482+03:00 storage8 [154990.897961]  [<ffffffffaf388d23>] ? xfs_iunpin_wait+0xf3/0x1a0
2017-01-28T04:08:40.704496+03:00 storage8 [154990.898089]  [<ffffffffaf0d47d0>] ? wake_atomic_t_function+0x40/0x40
2017-01-28T04:08:40.704610+03:00 storage8 [154990.898216]  [<ffffffffaf37b433>] ? xfs_reclaim_inode+0xd3/0x3d0
2017-01-28T04:08:40.704737+03:00 storage8 [154990.898345]  [<ffffffffaf37b8e4>] ? xfs_reclaim_inodes_ag+0x1b4/0x2c0
2017-01-28T04:08:40.704863+03:00 storage8 [154990.898475]  [<ffffffffaf37ce31>] ? xfs_reclaim_inodes_nr+0x31/0x40
2017-01-28T04:08:40.704993+03:00 storage8 [154990.898606]  [<ffffffffaf20b030>] ? super_cache_scan+0x1a0/0x1b0
2017-01-28T04:08:40.705127+03:00 storage8 [154990.898734]  [<ffffffffaf195cc2>] ? shrink_slab+0x262/0x440
2017-01-28T04:08:40.705249+03:00 storage8 [154990.898861]  [<ffffffffaf0bd4af>] ? try_to_wake_up+0x1df/0x370
2017-01-28T04:08:40.705403+03:00 storage8 [154990.898993]  [<ffffffffaf19966f>] ? shrink_node+0xef/0x2d0
2017-01-28T04:08:40.705524+03:00 storage8 [154990.899133]  [<ffffffffaf199b44>] ? do_try_to_free_pages+0xc4/0x2e0
2017-01-28T04:08:40.705663+03:00 storage8 [154990.899265]  [<ffffffffaf19a014>] ? try_to_free_pages+0xe4/0x1c0
2017-01-28T04:08:40.705792+03:00 storage8 [154990.899394]  [<ffffffffaf18a6eb>] ? __alloc_pages_nodemask+0x78b/0xe50
2017-01-28T04:08:40.705912+03:00 storage8 [154990.899526]  [<ffffffffaf2c8873>] ? __ext4_journal_stop+0x83/0xc0
2017-01-28T04:08:40.706061+03:00 storage8 [154990.899655]  [<ffffffffaf1e27c3>] ? kmem_cache_alloc+0x113/0x1b0
2017-01-28T04:08:40.706176+03:00 storage8 [154990.899784]  [<ffffffffaf1d855a>] ? alloc_pages_current+0x9a/0x120
2017-01-28T04:08:40.706316+03:00 storage8 [154990.899910]  [<ffffffffaf1e03fb>] ? new_slab+0x39b/0x600
2017-01-28T04:08:40.706432+03:00 storage8 [154990.900037]  [<ffffffffaf42177e>] ? bio_attempt_back_merge+0x8e/0x110
2017-01-28T04:08:40.706557+03:00 storage8 [154990.900168]  [<ffffffffaf1e1a84>] ? ___slab_alloc+0x3e4/0x580
2017-01-28T04:08:40.706686+03:00 storage8 [154990.900297]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.706821+03:00 storage8 [154990.900425]  [<ffffffffaf420675>] ? generic_make_request+0x105/0x190
2017-01-28T04:08:40.706943+03:00 storage8 [154990.900556]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.707077+03:00 storage8 [154990.900684]  [<ffffffffaf20452f>] ? __slab_alloc+0xe/0x12
2017-01-28T04:08:40.707245+03:00 storage8 [154990.900816]  [<ffffffffaf1e2856>] ? kmem_cache_alloc+0x1a6/0x1b0
2017-01-28T04:08:40.707337+03:00 storage8 [154990.900945]  [<ffffffffaf29dffb>] ? ext4_init_io_end+0x1b/0x40
2017-01-28T04:08:40.707471+03:00 storage8 [154990.901073]  [<ffffffffaf29c4f8>] ? ext4_writepages+0x438/0xd80
2017-01-28T04:08:40.707603+03:00 storage8 [154990.901205]  [<ffffffffaf0d40a0>] ? __wake_up_common+0x50/0x90
2017-01-28T04:08:40.707743+03:00 storage8 [154990.901344]  [<ffffffffaf23462d>] ? __writeback_single_inode+0x3d/0x340
2017-01-28T04:08:40.707872+03:00 storage8 [154990.901481]  [<ffffffffaf235091>] ? writeback_sb_inodes+0x1e1/0x440
2017-01-28T04:08:40.708002+03:00 storage8 [154990.901615]  [<ffffffffaf23537d>] ? __writeback_inodes_wb+0x8d/0xc0
2017-01-28T04:08:40.708136+03:00 storage8 [154990.901742]  [<ffffffffaf2355e7>] ? wb_writeback+0x237/0x2c0
2017-01-28T04:08:40.708264+03:00 storage8 [154990.901873]  [<ffffffffaf235d96>] ? wb_workfn+0x1f6/0x370
2017-01-28T04:08:40.708390+03:00 storage8 [154990.902006]  [<ffffffffaf0ad2a4>] ? process_one_work+0x124/0x3b0
2017-01-28T04:08:40.708538+03:00 storage8 [154990.902134]  [<ffffffffaf0ad693>] ? worker_thread+0x123/0x470
2017-01-28T04:08:40.708652+03:00 storage8 [154990.902263]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:08:40.708785+03:00 storage8 [154990.902394]  [<ffffffffaf0ad570>] ? process_scheduled_works+0x40/0x40
2017-01-28T04:08:40.708929+03:00 storage8 [154990.902526]  [<ffffffffaf0b3672>] ? kthread+0xc2/0xe0
2017-01-28T04:08:40.709044+03:00 storage8 [154990.902655]  [<ffffffffaf0b35b0>] ? __kthread_init_worker+0xb0/0xb0
2017-01-28T04:08:40.709181+03:00 storage8 [154990.902784]  [<ffffffffaf833662>] ? ret_from_fork+0x22/0x30
----------

And why do you want to limit __GFP_KILLABLE to "do allocation in the loop" in (2) ?
Any single allocation can use __GFP_KILLABLE. Not allocating from memory reserves
(even it is a single page) will reduce the possibility of falling into OOM livelock on
CONFIG_MMU=n kernels, reduce the possibility of unwanted allocation stalls/failures,
for it will help preserving memory reserves for allocation requests which are really
important.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:26 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:26 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] fix few OOM victim allocation runaways
@ 2017-02-01  9:26 ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-02-01  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Tetsuo Handa, Al Viro, linux-mm, linux-fsdevel, LKML

Hi,
these three patches tried to address a simple OOM victim runaways when
the oom victim can deplete the memory reserve completely. Tetsuo was able
to trigger the depletion in the write(2) path and I believe the similar
is possible for the read part. Vmalloc would be a bit harder but still
not impossible.

Unfortunately I do not see a better way around this issue as long as we
give OOM victims access to memory reserves without any limits. I have
tried to limit this access [1] which would help at least to keep some
memory for emergency actions. Anyway, even if we limit the amount of
reserves the OOM victim can consume it is still preferable to back off
before accessible reserves are depleted.

Tetsuo was suggesting introducing __GFP_KILLABLE which would fail the
allocation rather than consuming the reserves. I see two problems with
this approach.
        1) in order this flags work as expected all the blocking
        operations in the allocator call chain (including the direct
        reclaim) would have to be killable and this is really non
        trivial to achieve. Especially when we do not have any control
        over shrinkers.
        2) even if the above could be dealt with we would still have to
        find all the places which do allocation in the loop based on
        the user request. So it wouldn't be simpler than an explicit
        fatal_signal_pending check.

Thoughts?
Michal Hocko (3):
      fs: break out of iomap_file_buffered_write on fatal signals
      mm, fs: check for fatal signals in do_generic_file_read
      vmalloc: back of when the current is killed

 fs/dax.c     | 5 +++++
 fs/iomap.c   | 3 +++
 mm/filemap.c | 5 +++++
 mm/vmalloc.c | 5 +++++
 4 files changed, 18 insertions(+)

[1] http://lkml.kernel.org/r/20161004090009.7974-2-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-02-01 11:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-01  9:27 [PATCH 0/3] fix few OOM victim allocation runaways Michal Hocko
2017-02-01  9:27 ` Michal Hocko
2017-02-01  9:27 ` Michal Hocko
2017-02-01  9:27 ` [PATCH 1/3] fs: break out of iomap_file_buffered_write on fatal signals Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:28   ` Christoph Hellwig
2017-02-01  9:28     ` Christoph Hellwig
2017-02-01  9:27 ` [PATCH 2/3] mm, fs: check for fatal signals in do_generic_file_read Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:28   ` Christoph Hellwig
2017-02-01  9:28     ` Christoph Hellwig
2017-02-01  9:27 ` [PATCH 3/3] vmalloc: back of when the current is killed Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:27   ` Michal Hocko
2017-02-01  9:28   ` Christoph Hellwig
2017-02-01  9:28     ` Christoph Hellwig
2017-02-01 11:49 ` [PATCH 0/3] fix few OOM victim allocation runaways Tetsuo Handa
2017-02-01 11:49   ` Tetsuo Handa
  -- strict thread matches above, loose matches on Subject: below --
2017-02-01  9:26 Michal Hocko
2017-02-01  9:26 ` Michal Hocko
2017-02-01  9:26 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.