* [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size @ 2021-03-23 8:14 Ming Lei 2021-03-23 8:14 ` [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending Ming Lei ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Ming Lei @ 2021-03-23 8:14 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-kernel, Ming Lei blktrace may pass big trace buffer size via '-b', meantime the system may have lots of CPU cores, so too much memory can be allocated for blktrace. The 1st patch shutdown bltrace in blkdev_close() in case of task exiting, for avoiding trace buffer leak. The 2nd patch limits max trace buffer size for avoiding potential OOM. Ming Lei (2): block: shutdown blktrace in case of fatal signal pending blktrace: limit allowed total trace buffer size fs/block_dev.c | 6 ++++++ kernel/trace/blktrace.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) -- 2.29.2 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-03-23 8:14 [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size Ming Lei @ 2021-03-23 8:14 ` Ming Lei 2021-03-30 16:53 ` Christoph Hellwig 2021-03-23 8:14 ` [PATCH 2/2] blktrace: limit allowed total trace buffer size Ming Lei 2021-03-30 2:04 ` [PATCH 0/2] blktrace: fix trace buffer leak and limit " Ming Lei 2 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-03-23 8:14 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-kernel, Ming Lei blktrace may allocate lots of memory, if the process is terminated by user or OOM, we need to provide one chance to remove the trace buffer, otherwise memory leak may be caused. Fix the issue by shutdown blktrace in case of task exiting in blkdev_close(). Signed-off-by: Ming Lei <ming.lei@redhat.com> --- fs/block_dev.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 92ed7d5df677..1370eb731cea 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -34,6 +34,7 @@ #include <linux/part_stat.h> #include <linux/uaccess.h> #include <linux/suspend.h> +#include <linux/blktrace_api.h> #include "internal.h" struct bdev_inode { @@ -1646,6 +1647,11 @@ EXPORT_SYMBOL(blkdev_put); static int blkdev_close(struct inode * inode, struct file * filp) { struct block_device *bdev = I_BDEV(bdev_file_inode(filp)); + + /* shutdown blktrace in case of exiting which may be from OOM */ + if (current->flags & PF_EXITING) + blk_trace_shutdown(bdev->bd_disk->queue); + blkdev_put(bdev, filp->f_mode); return 0; } -- 2.29.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-03-23 8:14 ` [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending Ming Lei @ 2021-03-30 16:53 ` Christoph Hellwig 2021-03-31 0:16 ` Ming Lei 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2021-03-30 16:53 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-kernel On Tue, Mar 23, 2021 at 04:14:39PM +0800, Ming Lei wrote: > blktrace may allocate lots of memory, if the process is terminated > by user or OOM, we need to provide one chance to remove the trace > buffer, otherwise memory leak may be caused. > > Fix the issue by shutdown blktrace in case of task exiting in > blkdev_close(). > > Signed-off-by: Ming Lei <ming.lei@redhat.com> This just seems weird. blktrace has no relationship to open block device instances. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-03-30 16:53 ` Christoph Hellwig @ 2021-03-31 0:16 ` Ming Lei 2021-04-02 17:27 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-03-31 0:16 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-kernel On Tue, Mar 30, 2021 at 06:53:30PM +0200, Christoph Hellwig wrote: > On Tue, Mar 23, 2021 at 04:14:39PM +0800, Ming Lei wrote: > > blktrace may allocate lots of memory, if the process is terminated > > by user or OOM, we need to provide one chance to remove the trace > > buffer, otherwise memory leak may be caused. > > > > Fix the issue by shutdown blktrace in case of task exiting in > > blkdev_close(). > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > This just seems weird. blktrace has no relationship to open > block device instances. blktrace still needs to open one blkdev, then send its own ioctl commands to block layer. In case of OOM, the allocated memory in these ioctl commands won't be released. Or any other suggestion? -- Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-03-31 0:16 ` Ming Lei @ 2021-04-02 17:27 ` Christoph Hellwig 2021-04-03 8:10 ` Ming Lei 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2021-04-02 17:27 UTC (permalink / raw) To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel On Wed, Mar 31, 2021 at 08:16:50AM +0800, Ming Lei wrote: > On Tue, Mar 30, 2021 at 06:53:30PM +0200, Christoph Hellwig wrote: > > On Tue, Mar 23, 2021 at 04:14:39PM +0800, Ming Lei wrote: > > > blktrace may allocate lots of memory, if the process is terminated > > > by user or OOM, we need to provide one chance to remove the trace > > > buffer, otherwise memory leak may be caused. > > > > > > Fix the issue by shutdown blktrace in case of task exiting in > > > blkdev_close(). > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > This just seems weird. blktrace has no relationship to open > > block device instances. > > blktrace still needs to open one blkdev, then send its own ioctl > commands to block layer. In case of OOM, the allocated memory in > these ioctl commands won't be released. > > Or any other suggestion? Not much we can do there I think. If we want to autorelease memory it needs to be an API that ties the memory allocation to an FD. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-04-02 17:27 ` Christoph Hellwig @ 2021-04-03 8:10 ` Ming Lei 2021-04-03 9:04 ` Ming Lei 2021-04-06 6:30 ` Christoph Hellwig 0 siblings, 2 replies; 13+ messages in thread From: Ming Lei @ 2021-04-03 8:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-kernel On Fri, Apr 02, 2021 at 07:27:30PM +0200, Christoph Hellwig wrote: > On Wed, Mar 31, 2021 at 08:16:50AM +0800, Ming Lei wrote: > > On Tue, Mar 30, 2021 at 06:53:30PM +0200, Christoph Hellwig wrote: > > > On Tue, Mar 23, 2021 at 04:14:39PM +0800, Ming Lei wrote: > > > > blktrace may allocate lots of memory, if the process is terminated > > > > by user or OOM, we need to provide one chance to remove the trace > > > > buffer, otherwise memory leak may be caused. > > > > > > > > Fix the issue by shutdown blktrace in case of task exiting in > > > > blkdev_close(). > > > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > > > This just seems weird. blktrace has no relationship to open > > > block device instances. > > > > blktrace still needs to open one blkdev, then send its own ioctl > > commands to block layer. In case of OOM, the allocated memory in > > these ioctl commands won't be released. > > > > Or any other suggestion? > > Not much we can do there I think. If we want to autorelease memory > it needs to be an API that ties the memory allocation to an FD. We still may shutdown blktrace if current is the last opener, otherwise new blktrace can't be started and memory should be leaked forever, and what do you think of the revised version? From de33ec85ee1ce2865aa04f2639e480ea4db4eebf Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@redhat.com> Date: Tue, 23 Mar 2021 10:32:23 +0800 Subject: [PATCH] block: shutdown blktrace in case of task exiting blktrace may allocate lots of memory, if the process is terminated by user or OOM, we need to provide one chance to remove the trace buffer, otherwise memory leak may be caused. Also new blktrace instance can't be started too. Fix the issue by shutdown blktrace in case of task exiting in blkdev_close() when it is the last opener. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- fs/block_dev.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 92ed7d5df677..8fa59cecce72 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -34,6 +34,7 @@ #include <linux/part_stat.h> #include <linux/uaccess.h> #include <linux/suspend.h> +#include <linux/blktrace_api.h> #include "internal.h" struct bdev_inode { @@ -1646,6 +1647,11 @@ EXPORT_SYMBOL(blkdev_put); static int blkdev_close(struct inode * inode, struct file * filp) { struct block_device *bdev = I_BDEV(bdev_file_inode(filp)); + + /* shutdown blktrace in case of exiting which may be from OOM */ + if ((current->flags & PF_EXITING) && (bdev->bd_openers == 1)) + blk_trace_shutdown(bdev->bd_disk->queue); + blkdev_put(bdev, filp->f_mode); return 0; } -- 2.29.2 -- Ming ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-04-03 8:10 ` Ming Lei @ 2021-04-03 9:04 ` Ming Lei 2021-04-06 6:30 ` Christoph Hellwig 1 sibling, 0 replies; 13+ messages in thread From: Ming Lei @ 2021-04-03 9:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-kernel On Sat, Apr 03, 2021 at 04:10:16PM +0800, Ming Lei wrote: > On Fri, Apr 02, 2021 at 07:27:30PM +0200, Christoph Hellwig wrote: > > On Wed, Mar 31, 2021 at 08:16:50AM +0800, Ming Lei wrote: > > > On Tue, Mar 30, 2021 at 06:53:30PM +0200, Christoph Hellwig wrote: > > > > On Tue, Mar 23, 2021 at 04:14:39PM +0800, Ming Lei wrote: > > > > > blktrace may allocate lots of memory, if the process is terminated > > > > > by user or OOM, we need to provide one chance to remove the trace > > > > > buffer, otherwise memory leak may be caused. > > > > > > > > > > Fix the issue by shutdown blktrace in case of task exiting in > > > > > blkdev_close(). > > > > > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > > > > > This just seems weird. blktrace has no relationship to open > > > > block device instances. > > > > > > blktrace still needs to open one blkdev, then send its own ioctl > > > commands to block layer. In case of OOM, the allocated memory in > > > these ioctl commands won't be released. > > > > > > Or any other suggestion? > > > > Not much we can do there I think. If we want to autorelease memory > > it needs to be an API that ties the memory allocation to an FD. > > We still may shutdown blktrace if current is the last opener, otherwise > new blktrace can't be started and memory should be leaked forever, and > what do you think of the revised version? This way seems not good enough, another better one is to use file->private_data for such purpose since blkdev fs doesn't use file->privete_data, then we can shutdown blktrace just for the blktrace FD: From 191dff30abfd48c38a78dec78e011a39a3b606ca Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@redhat.com> Date: Tue, 23 Mar 2021 10:32:23 +0800 Subject: [PATCH] block: shutdown blktrace in case of task exiting blktrace may allocate lots of memory, if the process is terminated by user or OOM, we need to provide one chance to remove the trace buffer, otherwise memory leak may be caused. Also new blktrace instance can't be started too. Fix the issue by shutdown blktrace in bdev_close() if blktrace was setup on this FD. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/ioctl.c | 2 ++ fs/block_dev.c | 12 ++++++++++++ include/linux/blktrace_api.h | 11 +++++++++++ 3 files changed, 25 insertions(+) diff --git a/block/ioctl.c b/block/ioctl.c index ff241e663c01..7dad4a546db3 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -611,6 +611,8 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) else mode &= ~FMODE_NDELAY; + blkdev_mark_blktrace(file, cmd); + switch (cmd) { /* These need separate implementations for the data structure */ case HDIO_GETGEO: diff --git a/fs/block_dev.c b/fs/block_dev.c index 92ed7d5df677..aaa7d7d1e5a4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -34,6 +34,7 @@ #include <linux/part_stat.h> #include <linux/uaccess.h> #include <linux/suspend.h> +#include <linux/blktrace_api.h> #include "internal.h" struct bdev_inode { @@ -1646,6 +1647,15 @@ EXPORT_SYMBOL(blkdev_put); static int blkdev_close(struct inode * inode, struct file * filp) { struct block_device *bdev = I_BDEV(bdev_file_inode(filp)); + + /* + * The task running blktrace is supposed to shutdown blktrace + * by ioctl. If they forget to shutdown or can't do it because + * of OOM or sort of situation, we shutdown for them. + */ + if (blkdev_has_run_blktrace(filp)) + blk_trace_shutdown(bdev->bd_disk->queue); + blkdev_put(bdev, filp->f_mode); return 0; } @@ -1664,6 +1674,8 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) else mode &= ~FMODE_NDELAY; + blkdev_mark_blktrace(file, cmd); + return blkdev_ioctl(bdev, mode, cmd, arg); } diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h index a083e15df608..754058c1965c 100644 --- a/include/linux/blktrace_api.h +++ b/include/linux/blktrace_api.h @@ -135,4 +135,15 @@ static inline unsigned int blk_rq_trace_nr_sectors(struct request *rq) return blk_rq_is_passthrough(rq) ? 0 : blk_rq_sectors(rq); } +static inline void blkdev_mark_blktrace(struct file *file, unsigned int cmd) +{ + if (cmd == BLKTRACESETUP) + file->private_data = (void *)-1; +} + +static inline bool blkdev_has_run_blktrace(struct file *file) +{ + return file->private_data == (void *)-1; +} + #endif -- 2.29.2 -- Ming ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending 2021-04-03 8:10 ` Ming Lei 2021-04-03 9:04 ` Ming Lei @ 2021-04-06 6:30 ` Christoph Hellwig 1 sibling, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2021-04-06 6:30 UTC (permalink / raw) To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel On Sat, Apr 03, 2021 at 04:10:16PM +0800, Ming Lei wrote: > We still may shutdown blktrace if current is the last opener, otherwise > new blktrace can't be started and memory should be leaked forever, and > what do you think of the revised version? I don't think this works. For one there might be users of the blktrace ioctl that explicitly rely on this not happening as difference processes might start the tracing vs actually consume the trace data. Second this might not actually work as another process could be the last opener. If you want to fix this for the blktrace tool (common) case I think we need a new ioctl that explicitly ties the buffer lifetime to the fd. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] blktrace: limit allowed total trace buffer size 2021-03-23 8:14 [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size Ming Lei 2021-03-23 8:14 ` [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending Ming Lei @ 2021-03-23 8:14 ` Ming Lei 2021-03-30 2:57 ` Su Yue 2021-03-30 16:57 ` Christoph Hellwig 2021-03-30 2:04 ` [PATCH 0/2] blktrace: fix trace buffer leak and limit " Ming Lei 2 siblings, 2 replies; 13+ messages in thread From: Ming Lei @ 2021-03-23 8:14 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-kernel, Ming Lei On some ARCHs, such as aarch64, page size may be 64K, meantime there may be lots of CPU cores. relay_open() needs to allocate pages on each CPU blktrace, so easily too many pages are taken by blktrace. For example, on one ARM64 server: 224 CPU cores, 16G RAM, blktrace finally got allocated 7GB in case of 'blktrace -b 8192' which is used by device-mapper test suite[1]. This way could cause OOM easily. Fix the issue by limiting max allowed pages to be 1/8 of totalram_pages(). [1] https://github.com/jthornber/device-mapper-test-suite.git Signed-off-by: Ming Lei <ming.lei@redhat.com> --- kernel/trace/blktrace.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index c221e4c3f625..8403ff19d533 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -466,6 +466,35 @@ static void blk_trace_setup_lba(struct blk_trace *bt, } } +/* limit total allocated buffer size is <= 1/8 of total pages */ +static void validate_and_adjust_buf(struct blk_user_trace_setup *buts) +{ + unsigned buf_size = buts->buf_size; + unsigned buf_nr = buts->buf_nr; + unsigned long max_allowed_pages = totalram_pages() >> 3; + unsigned long req_pages = PAGE_ALIGN(buf_size * buf_nr) >> PAGE_SHIFT; + + if (req_pages * num_online_cpus() <= max_allowed_pages) + return; + + req_pages = DIV_ROUND_UP(max_allowed_pages, num_online_cpus()); + + if (req_pages == 0) { + buf_size = PAGE_SIZE; + buf_nr = 1; + } else { + buf_size = req_pages << PAGE_SHIFT / buf_nr; + if (buf_size < PAGE_SIZE) + buf_size = PAGE_SIZE; + buf_nr = req_pages << PAGE_SHIFT / buf_size; + if (buf_nr == 0) + buf_nr = 1; + } + + buts->buf_size = min_t(unsigned, buf_size, buts->buf_size); + buts->buf_nr = min_t(unsigned, buf_nr, buts->buf_nr); +} + /* * Setup everything required to start tracing */ @@ -482,6 +511,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, if (!buts->buf_size || !buts->buf_nr) return -EINVAL; + /* make sure not allocate too much for userspace */ + validate_and_adjust_buf(buts); + strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; -- 2.29.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] blktrace: limit allowed total trace buffer size 2021-03-23 8:14 ` [PATCH 2/2] blktrace: limit allowed total trace buffer size Ming Lei @ 2021-03-30 2:57 ` Su Yue 2021-03-30 3:55 ` Ming Lei 2021-03-30 16:57 ` Christoph Hellwig 1 sibling, 1 reply; 13+ messages in thread From: Su Yue @ 2021-03-30 2:57 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-kernel On Tue 23 Mar 2021 at 16:14, Ming Lei <ming.lei@redhat.com> wrote: > On some ARCHs, such as aarch64, page size may be 64K, meantime > there may > be lots of CPU cores. relay_open() needs to allocate pages on > each CPU > blktrace, so easily too many pages are taken by blktrace. For > example, > on one ARM64 server: 224 CPU cores, 16G RAM, blktrace finally > got > allocated 7GB in case of 'blktrace -b 8192' which is used by > device-mapper > test suite[1]. This way could cause OOM easily. > > Fix the issue by limiting max allowed pages to be 1/8 of > totalram_pages(). > > [1] https://github.com/jthornber/device-mapper-test-suite.git > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > kernel/trace/blktrace.c | 32 ++++++++++++++++++++++++++++++++ > 1 file changed, 32 insertions(+) > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c > index c221e4c3f625..8403ff19d533 100644 > --- a/kernel/trace/blktrace.c > +++ b/kernel/trace/blktrace.c > @@ -466,6 +466,35 @@ static void blk_trace_setup_lba(struct > blk_trace *bt, > } > } > > +/* limit total allocated buffer size is <= 1/8 of total pages > */ > +static void validate_and_adjust_buf(struct blk_user_trace_setup > *buts) > +{ > + unsigned buf_size = buts->buf_size; > + unsigned buf_nr = buts->buf_nr; > + unsigned long max_allowed_pages = totalram_pages() >> 3; > + unsigned long req_pages = PAGE_ALIGN(buf_size * buf_nr) >> > PAGE_SHIFT; > + > + if (req_pages * num_online_cpus() <= max_allowed_pages) > + return; > + > + req_pages = DIV_ROUND_UP(max_allowed_pages, > num_online_cpus()); > + > + if (req_pages == 0) { > + buf_size = PAGE_SIZE; > + buf_nr = 1; > + } else { > + buf_size = req_pages << PAGE_SHIFT / buf_nr; > Should it be: buf_size = (req_pages << PAGE_SHIFT) / buf_nr; ? The priority of '<<' is lower than '/', right? :) -- Su > + if (buf_size < PAGE_SIZE) > + buf_size = PAGE_SIZE; > + buf_nr = req_pages << PAGE_SHIFT / buf_size; > + if (buf_nr == 0) > + buf_nr = 1; > + } > + > + buts->buf_size = min_t(unsigned, buf_size, buts->buf_size); > + buts->buf_nr = min_t(unsigned, buf_nr, buts->buf_nr); > +} > + > /* > * Setup everything required to start tracing > */ > @@ -482,6 +511,9 @@ static int do_blk_trace_setup(struct > request_queue *q, char *name, dev_t dev, > if (!buts->buf_size || !buts->buf_nr) > return -EINVAL; > > + /* make sure not allocate too much for userspace */ > + validate_and_adjust_buf(buts); > + > strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); > buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] blktrace: limit allowed total trace buffer size 2021-03-30 2:57 ` Su Yue @ 2021-03-30 3:55 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2021-03-30 3:55 UTC (permalink / raw) To: Su Yue; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-kernel On Tue, Mar 30, 2021 at 10:57:04AM +0800, Su Yue wrote: > > On Tue 23 Mar 2021 at 16:14, Ming Lei <ming.lei@redhat.com> wrote: > > > On some ARCHs, such as aarch64, page size may be 64K, meantime there may > > be lots of CPU cores. relay_open() needs to allocate pages on each CPU > > blktrace, so easily too many pages are taken by blktrace. For example, > > on one ARM64 server: 224 CPU cores, 16G RAM, blktrace finally got > > allocated 7GB in case of 'blktrace -b 8192' which is used by > > device-mapper > > test suite[1]. This way could cause OOM easily. > > > > Fix the issue by limiting max allowed pages to be 1/8 of > > totalram_pages(). > > > > [1] https://github.com/jthornber/device-mapper-test-suite.git > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > kernel/trace/blktrace.c | 32 ++++++++++++++++++++++++++++++++ > > 1 file changed, 32 insertions(+) > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c > > index c221e4c3f625..8403ff19d533 100644 > > --- a/kernel/trace/blktrace.c > > +++ b/kernel/trace/blktrace.c > > @@ -466,6 +466,35 @@ static void blk_trace_setup_lba(struct blk_trace > > *bt, > > } > > } > > > > +/* limit total allocated buffer size is <= 1/8 of total pages */ > > +static void validate_and_adjust_buf(struct blk_user_trace_setup *buts) > > +{ > > + unsigned buf_size = buts->buf_size; > > + unsigned buf_nr = buts->buf_nr; > > + unsigned long max_allowed_pages = totalram_pages() >> 3; > > + unsigned long req_pages = PAGE_ALIGN(buf_size * buf_nr) >> PAGE_SHIFT; > > + > > + if (req_pages * num_online_cpus() <= max_allowed_pages) > > + return; > > + > > + req_pages = DIV_ROUND_UP(max_allowed_pages, num_online_cpus()); > > + > > + if (req_pages == 0) { > > + buf_size = PAGE_SIZE; > > + buf_nr = 1; > > + } else { > > + buf_size = req_pages << PAGE_SHIFT / buf_nr; > > > Should it be: > buf_size = (req_pages << PAGE_SHIFT) / buf_nr; > ? > The priority of '<<' is lower than '/', right? :) Good catch, thanks! -- Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] blktrace: limit allowed total trace buffer size 2021-03-23 8:14 ` [PATCH 2/2] blktrace: limit allowed total trace buffer size Ming Lei 2021-03-30 2:57 ` Su Yue @ 2021-03-30 16:57 ` Christoph Hellwig 1 sibling, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2021-03-30 16:57 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-kernel On Tue, Mar 23, 2021 at 04:14:40PM +0800, Ming Lei wrote: > On some ARCHs, such as aarch64, page size may be 64K, meantime there may Which we call arm64.. > be lots of CPU cores. relay_open() needs to allocate pages on each CPU > blktrace, so easily too many pages are taken by blktrace. For example, > on one ARM64 server: 224 CPU cores, 16G RAM, blktrace finally got > allocated 7GB in case of 'blktrace -b 8192' which is used by device-mapper > test suite[1]. This way could cause OOM easily. > > Fix the issue by limiting max allowed pages to be 1/8 of totalram_pages(). Doesn't this break the blktrace ABI by using different buffer size and numbers than the user asked for? I think we can enforce an upper limit and error out, but silently adjusting seems wrong. Wouldn't it make more sense to fix userspace to not request so many and so big buffers instead? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size 2021-03-23 8:14 [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size Ming Lei 2021-03-23 8:14 ` [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending Ming Lei 2021-03-23 8:14 ` [PATCH 2/2] blktrace: limit allowed total trace buffer size Ming Lei @ 2021-03-30 2:04 ` Ming Lei 2 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2021-03-30 2:04 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, linux-kernel On Tue, Mar 23, 2021 at 04:14:38PM +0800, Ming Lei wrote: > blktrace may pass big trace buffer size via '-b', meantime the system > may have lots of CPU cores, so too much memory can be allocated for > blktrace. > > The 1st patch shutdown bltrace in blkdev_close() in case of task > exiting, for avoiding trace buffer leak. > > The 2nd patch limits max trace buffer size for avoiding potential > OOM. > > > Ming Lei (2): > block: shutdown blktrace in case of fatal signal pending > blktrace: limit allowed total trace buffer size > > fs/block_dev.c | 6 ++++++ > kernel/trace/blktrace.c | 32 ++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) Hello Guys, Ping... BTW, this is another OOM risk in blktrace userspace which is caused by mlock(16 * buffer_size) * nr_cpus, so I think we need to avoid memory leak caused by OOM. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-04-06 6:30 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-23 8:14 [PATCH 0/2] blktrace: fix trace buffer leak and limit trace buffer size Ming Lei 2021-03-23 8:14 ` [PATCH 1/2] block: shutdown blktrace in case of fatal signal pending Ming Lei 2021-03-30 16:53 ` Christoph Hellwig 2021-03-31 0:16 ` Ming Lei 2021-04-02 17:27 ` Christoph Hellwig 2021-04-03 8:10 ` Ming Lei 2021-04-03 9:04 ` Ming Lei 2021-04-06 6:30 ` Christoph Hellwig 2021-03-23 8:14 ` [PATCH 2/2] blktrace: limit allowed total trace buffer size Ming Lei 2021-03-30 2:57 ` Su Yue 2021-03-30 3:55 ` Ming Lei 2021-03-30 16:57 ` Christoph Hellwig 2021-03-30 2:04 ` [PATCH 0/2] blktrace: fix trace buffer leak and limit " Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).