All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
@ 2018-08-02  5:11 Minchan Kim
  2018-08-02 21:13 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-08-02  5:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Minchan Kim, Sergey Senozhatsky, Tino Lehnig, stable

If zram supports writeback feature, it's no more syncrhonous
device beause zram does synchronous IO opeation for
incompressible page.

Do not pretend to be syncrhonous IO device. It makes system
very sluggish as waiting IO completion from upper layer.

Furthermore, it makes user-after-free problem because swap
think the opearion is done when the IO functions returns so
it could free page by will(e.g., lock_page_or_retry and
goto out_release in do_swap_page) but in fact, IO is
asynchrnous so driver could access just freed page afterward.

This patch fixes the problem.

BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x17fffc000000008(uptodate)
raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
bad because of flags: 0x8(uptodate)
Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bin
fmt_misc pcbc aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper iTCO_vendor_support intel_cstate lpc_ich mei_me intel_uncore intel_rapl_perf pcspkr joydev sg mfd_core ioatdma mei wmi evdev ipmi_si ipmi_devintf ipmi_msghandler
acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod xhci_pci ehci_pci ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit crc32c_intel scsi_mod i2c_i8
01 dca usbcore
CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 4.18.0-rc5+ #1
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
Call Trace:
 dump_stack+0x5c/0x7b
 bad_page+0xba/0x120
 get_page_from_freelist+0x1016/0x1250
 __alloc_pages_nodemask+0xfa/0x250
 alloc_pages_vma+0x7c/0x1c0
 do_swap_page+0x347/0x920
 ? __update_load_avg_se.isra.38+0x1eb/0x1f0
 ? cpumask_next_wrap+0x3d/0x60
 __handle_mm_fault+0x7b4/0x1110
 ? update_load_avg+0x5ea/0x720
 handle_mm_fault+0xfc/0x1f0
 __get_user_pages+0x12f/0x690
 get_user_pages_unlocked+0x148/0x1f0
 __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
 try_async_pf+0x87/0x230 [kvm]
 tdp_page_fault+0x132/0x290 [kvm]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 kvm_mmu_page_fault+0x74/0x570 [kvm]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
 ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
 ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
 kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
 ? __update_load_avg_se.isra.38+0x1eb/0x1f0
 ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
 kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
 ? __switch_to+0x395/0x450
 ? __switch_to+0x395/0x450
 do_vfs_ioctl+0xa2/0x630
 ? __schedule+0x3fd/0x890
 ksys_ioctl+0x70/0x80
 ? exit_to_usermode_loop+0xca/0xf0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb30361add7
Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007fb2e97f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb30361add7
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
RBP: 00005652b984e0f0 R08: 00005652b7d513d0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb308c66000 R14: 0000000000000000 R15: 00005652b984e0f0

Link: https://lore.kernel.org/lkml/0516ae2d-b0fd-92c5-aa92-112ba7bd32fc@contabo.de/
Reported-by: Tino Lehnig <tino.lehnig@contabo.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Tino Lehnig <tino.lehnig@contabo.de>
Cc: <stable@vger.kernel.org> # v4.15+
Tested-by: Tino Lehnig <tino.lehnig@contabo.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7436b2d27fa3..0b6eda1bd77a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -298,7 +298,8 @@ static void reset_bdev(struct zram *zram)
 	zram->backing_dev = NULL;
 	zram->old_block_size = 0;
 	zram->bdev = NULL;
-
+	zram->disk->queue->backing_dev_info->capabilities |=
+				BDI_CAP_SYNCHRONOUS_IO;
 	kvfree(zram->bitmap);
 	zram->bitmap = NULL;
 }
@@ -400,6 +401,8 @@ static ssize_t backing_dev_store(struct device *dev,
 	zram->backing_dev = backing_dev;
 	zram->bitmap = bitmap;
 	zram->nr_pages = nr_pages;
+	zram->disk->queue->backing_dev_info->capabilities &=
+			~BDI_CAP_SYNCHRONOUS_IO;
 	up_write(&zram->init_lock);
 
 	pr_info("setup backing device %s\n", file_name);
-- 
2.18.0.597.ga71716f1ad-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-02  5:11 [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature Minchan Kim
@ 2018-08-02 21:13 ` Andrew Morton
  2018-08-03  2:39   ` Sergey Senozhatsky
  2018-08-03  2:51   ` Minchan Kim
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2018-08-02 21:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: LKML, Sergey Senozhatsky, Tino Lehnig, stable, Jens Axboe

On Thu,  2 Aug 2018 14:11:12 +0900 Minchan Kim <minchan@kernel.org> wrote:

> If zram supports writeback feature, it's no more syncrhonous
> device beause zram does synchronous IO opeation for
> incompressible page.
> 
> Do not pretend to be syncrhonous IO device. It makes system
> very sluggish as waiting IO completion from upper layer.
> 
> Furthermore, it makes user-after-free problem because swap
> think the opearion is done when the IO functions returns so
> it could free page by will(e.g., lock_page_or_retry and
> goto out_release in do_swap_page) but in fact, IO is
> asynchrnous so driver could access just freed page afterward.
> 
> This patch fixes the problem.

That changelog is rather hard to follow.  Please review my edits:

: If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO
: device beause zram does synchronous IO operations for incompressible pages.
: 
: Do not pretend to be synchronous IO device.  It makes the system very
: sluggish due to waiting for IO completion from upper layers.
: 
: Furthermore, it causes a user-after-free problem because swap thinks the
: opearion is done when the IO functions returns so it can free the page
: (e.g., lock_page_or_retry and goto out_release in do_swap_page) but in
: fact, IO is asynchrnous so the driver could access a just freed page
: afterward.
: 
: This patch fixes the problem. 

> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 7436b2d27fa3..0b6eda1bd77a 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -298,7 +298,8 @@ static void reset_bdev(struct zram *zram)
>  	zram->backing_dev = NULL;
>  	zram->old_block_size = 0;
>  	zram->bdev = NULL;
> -
> +	zram->disk->queue->backing_dev_info->capabilities |=
> +				BDI_CAP_SYNCHRONOUS_IO;
>  	kvfree(zram->bitmap);
>  	zram->bitmap = NULL;
>  }
> @@ -400,6 +401,8 @@ static ssize_t backing_dev_store(struct device *dev,
>  	zram->backing_dev = backing_dev;
>  	zram->bitmap = bitmap;
>  	zram->nr_pages = nr_pages;
> +	zram->disk->queue->backing_dev_info->capabilities &=
> +			~BDI_CAP_SYNCHRONOUS_IO;
>  	up_write(&zram->init_lock);
>  
>  	pr_info("setup backing device %s\n", file_name);

A reader looking at this would wonder "why the heck are we doing that".
Adding a code comment would help them.

Is it legitimate to be altering the bdi capabilities at this level?  Or
is this hacky?

If "yes" then should reset_bdev() be unconditionally setting
BDI_CAP_SYNCHRONOUS_IO?  Shouldn't it be restoring that flag to its
previous setting?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-02 21:13 ` Andrew Morton
@ 2018-08-03  2:39   ` Sergey Senozhatsky
  2018-08-03  2:52     ` Sergey Senozhatsky
  2018-08-03  3:00     ` Minchan Kim
  2018-08-03  2:51   ` Minchan Kim
  1 sibling, 2 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-08-03  2:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, LKML, Sergey Senozhatsky, Tino Lehnig, stable, Jens Axboe

On (08/02/18 14:13), Andrew Morton wrote:
[..]
> That changelog is rather hard to follow.  Please review my edits:
> 
> : If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO
							^BDI_CAP_SYNCHRONOUS_IO

[..]

> A reader looking at this would wonder "why the heck are we doing that".
> Adding a code comment would help them.

The interesting thing here is that include/linux/backing-dev.h
BDI_CAP_SYNCHRONOUS_IO comment says

	"Device is so fast that asynchronous IO would be inefficient."

Which is not the reason why BDI_CAP_SYNCHRONOUS_IO is used by ZRAM.
Probably, the comment needs to be updated as well.

Both SWP_SYNCHRONOUS_IO and BDI_CAP_SYNCHRONOUS_IO tend to pivot
"efficiency" [looking at the comments], but in ZRAM's case the whole
reason to use SYNC IO is a race condition and user-after-free that
follows.

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-02 21:13 ` Andrew Morton
  2018-08-03  2:39   ` Sergey Senozhatsky
@ 2018-08-03  2:51   ` Minchan Kim
  2018-08-03 23:28     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-08-03  2:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Tino Lehnig, stable, Jens Axboe

Hi Andrew,

On Thu, Aug 02, 2018 at 02:13:04PM -0700, Andrew Morton wrote:
> On Thu,  2 Aug 2018 14:11:12 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > If zram supports writeback feature, it's no more syncrhonous
> > device beause zram does synchronous IO opeation for
> > incompressible page.
> > 
> > Do not pretend to be syncrhonous IO device. It makes system
> > very sluggish as waiting IO completion from upper layer.
> > 
> > Furthermore, it makes user-after-free problem because swap
> > think the opearion is done when the IO functions returns so
> > it could free page by will(e.g., lock_page_or_retry and
> > goto out_release in do_swap_page) but in fact, IO is
> > asynchrnous so driver could access just freed page afterward.
> > 
> > This patch fixes the problem.

I fixed my faults from original description.
Otherwise, ones you corrected looks good to me.

> 
> That changelog is rather hard to follow.  Please review my edits:
> 
> : If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO
> : device beause zram does synchronous IO operations for incompressible pages.

                            asynchronous

> : 
> : Do not pretend to be synchronous IO device.  It makes the system very
> : sluggish due to waiting for IO completion from upper layers.
> : 
> : Furthermore, it causes a user-after-free problem because swap thinks the
> : opearion is done when the IO functions returns so it can free the page
> : (e.g., lock_page_or_retry and goto out_release in do_swap_page) but in
> : fact, IO is asynchrnous so the driver could access a just freed page

                asynchronous 

> : afterward.
> : 
> : This patch fixes the problem. 
> 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 7436b2d27fa3..0b6eda1bd77a 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -298,7 +298,8 @@ static void reset_bdev(struct zram *zram)
> >  	zram->backing_dev = NULL;
> >  	zram->old_block_size = 0;
> >  	zram->bdev = NULL;
> > -
> > +	zram->disk->queue->backing_dev_info->capabilities |=
> > +				BDI_CAP_SYNCHRONOUS_IO;
> >  	kvfree(zram->bitmap);
> >  	zram->bitmap = NULL;
> >  }
> > @@ -400,6 +401,8 @@ static ssize_t backing_dev_store(struct device *dev,
> >  	zram->backing_dev = backing_dev;
> >  	zram->bitmap = bitmap;
> >  	zram->nr_pages = nr_pages;
> > +	zram->disk->queue->backing_dev_info->capabilities &=
> > +			~BDI_CAP_SYNCHRONOUS_IO;
> >  	up_write(&zram->init_lock);
> >  
> >  	pr_info("setup backing device %s\n", file_name);
> 
> A reader looking at this would wonder "why the heck are we doing that".
> Adding a code comment would help them.

I will add

/*
 * With writeback feature, zram does a asynchronous IO so it's no longer
 * synchronous device. If it pretend to be, upper layer could wait IO
 * completion rather than (submit and return), which will cause system
 * sluggish.
 * Furthermore, when the IO function returns(e.g., swap_readpage),
 * uppler lay could guess IO was done so it could deallocate the page
 * freely but in fact, IO is going on and it finally could cause
 * use-after-free when the IO is really done.
 */

> 
> Is it legitimate to be altering the bdi capabilities at this level?  Or
> is this hacky?

Most of device's bdi capability seems to be static but there are few drivers
which can change capability. For example, BDI_CAP_STABLE_WRITES

drivers/scsi/iscsi_tcp.c
drivers/md/raid5.c

I believe it's driver itself advertisement stuff so I hope it's not hack.

> 
> If "yes" then should reset_bdev() be unconditionally setting
> BDI_CAP_SYNCHRONOUS_IO?  Shouldn't it be restoring that flag to its
> previous setting?
> 

Yu, reset_bdev should set it unconditionally. Because zram's default
mode is synchronous and it changed only if user set backing device.

I will respin the patch with revised comment and description.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  2:39   ` Sergey Senozhatsky
@ 2018-08-03  2:52     ` Sergey Senozhatsky
  2018-08-03  3:00     ` Minchan Kim
  1 sibling, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-08-03  2:52 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, LKML, Tino Lehnig, stable, Jens Axboe

On (08/03/18 11:39), Sergey Senozhatsky wrote:
> [..]
>
> > A reader looking at this would wonder "why the heck are we doing that".
> > Adding a code comment would help them.
> 
> The interesting thing here is that include/linux/backing-dev.h
> BDI_CAP_SYNCHRONOUS_IO comment says
> 
> 	"Device is so fast that asynchronous IO would be inefficient."
> 
> Which is not the reason why BDI_CAP_SYNCHRONOUS_IO is used by ZRAM.
> Probably, the comment needs to be updated as well.
> 
> Both SWP_SYNCHRONOUS_IO and BDI_CAP_SYNCHRONOUS_IO tend to pivot
> "efficiency" [looking at the comments], but in ZRAM's case the whole
> reason to use SYNC IO is a race condition and user-after-free that
               ^ASYNC IO

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  2:39   ` Sergey Senozhatsky
  2018-08-03  2:52     ` Sergey Senozhatsky
@ 2018-08-03  3:00     ` Minchan Kim
  2018-08-03  4:13       ` Sergey Senozhatsky
  1 sibling, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-08-03  3:00 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML, Tino Lehnig, stable, Jens Axboe

Hi Sergey,

On Fri, Aug 03, 2018 at 11:39:29AM +0900, Sergey Senozhatsky wrote:
> On (08/02/18 14:13), Andrew Morton wrote:
> [..]
> > That changelog is rather hard to follow.  Please review my edits:
> > 
> > : If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO
> 							^BDI_CAP_SYNCHRONOUS_IO
> 
> [..]
> 
> > A reader looking at this would wonder "why the heck are we doing that".
> > Adding a code comment would help them.
> 
> The interesting thing here is that include/linux/backing-dev.h
> BDI_CAP_SYNCHRONOUS_IO comment says
> 
> 	"Device is so fast that asynchronous IO would be inefficient."
> 
> Which is not the reason why BDI_CAP_SYNCHRONOUS_IO is used by ZRAM.
> Probably, the comment needs to be updated as well.

I couldn't catch your point. Could you clarify a little bit more?
What do you want to correct for the comment?

> 
> Both SWP_SYNCHRONOUS_IO and BDI_CAP_SYNCHRONOUS_IO tend to pivot
> "efficiency" [looking at the comments], but in ZRAM's case the whole
> reason to use SYNC IO is a race condition and user-after-free that
> follows.

Actually, it's not whole reason. As I wrote down, without it, swap_readpage
waits the IO completion for a long time by blk_poll so it causes system
sluggish problem when device is slow(e.g., zram with backing device).

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  3:00     ` Minchan Kim
@ 2018-08-03  4:13       ` Sergey Senozhatsky
  2018-08-03  4:51         ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-08-03  4:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, LKML, Tino Lehnig, stable, Jens Axboe

Hi Minchan,

On (08/03/18 12:00), Minchan Kim wrote:
> > 	"Device is so fast that asynchronous IO would be inefficient."
> > 
> > Which is not the reason why BDI_CAP_SYNCHRONOUS_IO is used by ZRAM.
> > Probably, the comment needs to be updated as well.
> 
> I couldn't catch your point. Could you clarify a little bit more?
> What do you want to correct for the comment?
>
> > Both SWP_SYNCHRONOUS_IO and BDI_CAP_SYNCHRONOUS_IO tend to pivot
> > "efficiency" [looking at the comments], but in ZRAM's case the whole
> > reason to use SYNC IO is a race condition and user-after-free that
> > follows.
> 
> Actually, it's not whole reason. As I wrote down, without it, swap_readpage
> waits the IO completion for a long time by blk_poll so it causes system
> sluggish problem when device is slow(e.g., zram with backing device).

Sure, this is problem #1. But slow swap device probably doesn't do any
irreversible harm to the system. Unlike use-after-free, which does. Thus
use-after-free is a problem #0. BDI_CAP_SYNCHRONOUS_IO comment doesn't
mention problem #0; it talks about problem #1 only. So, nothing serious,
just wanted to point that out.

So we probably can make ZRAM always ASYNC when WB is enabled.


Or... maybe we can make swap out to be SYNC and perform WB in background.
In __zram_bvec_write() we can always write compressed object to zmalloc,
even the huge ones.
Things to note:
a) even when WB is enabled we still allocate huge classes
b) even when WB is enabled we still may use those huge classes (consider
   a case when backing devices is full)

So huge classes are still there and we still use them. So let's use
them?

For a huge object, after we stored it into zsmalloc, we can schedule a WB
work, which would:
a) write that particular object (page) to the backing device
b) mark entry as WB entry
c) remove object from zsmalloc, unlock necessary locks

So swap in should either see object in zsmalloc or on backing device.
How does this sound?

And reading from a backing device can always be SYNC. Can it?
Am I missing something?

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  4:13       ` Sergey Senozhatsky
@ 2018-08-03  4:51         ` Minchan Kim
  2018-08-03  5:23           ` Sergey Senozhatsky
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-08-03  4:51 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, LKML, Tino Lehnig, stable, Jens Axboe

On Fri, Aug 03, 2018 at 01:13:02PM +0900, Sergey Senozhatsky wrote:
> Hi Minchan,
> 
> On (08/03/18 12:00), Minchan Kim wrote:
> > > 	"Device is so fast that asynchronous IO would be inefficient."
> > > 
> > > Which is not the reason why BDI_CAP_SYNCHRONOUS_IO is used by ZRAM.
> > > Probably, the comment needs to be updated as well.
> > 
> > I couldn't catch your point. Could you clarify a little bit more?
> > What do you want to correct for the comment?
> >
> > > Both SWP_SYNCHRONOUS_IO and BDI_CAP_SYNCHRONOUS_IO tend to pivot
> > > "efficiency" [looking at the comments], but in ZRAM's case the whole
> > > reason to use SYNC IO is a race condition and user-after-free that
> > > follows.
> > 
> > Actually, it's not whole reason. As I wrote down, without it, swap_readpage
> > waits the IO completion for a long time by blk_poll so it causes system
> > sluggish problem when device is slow(e.g., zram with backing device).
> 
> Sure, this is problem #1. But slow swap device probably doesn't do any
> irreversible harm to the system. Unlike use-after-free, which does. Thus
> use-after-free is a problem #0. BDI_CAP_SYNCHRONOUS_IO comment doesn't
> mention problem #0; it talks about problem #1 only. So, nothing serious,
> just wanted to point that out.
> 
> So we probably can make ZRAM always ASYNC when WB is enabled.
> 
> 
> Or... maybe we can make swap out to be SYNC and perform WB in background.
> In __zram_bvec_write() we can always write compressed object to zmalloc,
> even the huge ones.
> Things to note:
> a) even when WB is enabled we still allocate huge classes
> b) even when WB is enabled we still may use those huge classes (consider
>    a case when backing devices is full)
> 
> So huge classes are still there and we still use them. So let's use
> them?
> 
> For a huge object, after we stored it into zsmalloc, we can schedule a WB
> work, which would:
> a) write that particular object (page) to the backing device
> b) mark entry as WB entry
> c) remove object from zsmalloc, unlock necessary locks
> 
> So swap in should either see object in zsmalloc or on backing device.
> How does this sound?
> 
> And reading from a backing device can always be SYNC. Can it?
> Am I missing something?

AFAIK, onging writeback page couldn't freed so it was not writeabck problem.

What I'm tryig to fix is read part.
If we use swapcache, it shouldn't be a problem either because swapcache
has a reference count and we should wait PG_lock release before the freeing
from the swapcache so there is no race condition.

However, by the skip swapcache logic, we don't have a refcount any longer.

I think we can hold a new refcount in zram driver itself. With that, we
could get both benefits from writeback feature and skip swapcache.

However, I decided, at this moment, going this simple way for
stable-material to solve #0 and #1 problems at the same time.

Thanks. 


> 
> 	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  4:51         ` Minchan Kim
@ 2018-08-03  5:23           ` Sergey Senozhatsky
  2018-08-03  5:45             ` Sergey Senozhatsky
  0 siblings, 1 reply; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-08-03  5:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, LKML, Tino Lehnig, stable, Jens Axboe

On (08/03/18 13:51), Minchan Kim wrote:
> 
> AFAIK, onging writeback page couldn't freed so it was not writeabck problem.
> 
> What I'm tryig to fix is read part.
> If we use swapcache, it shouldn't be a problem either because swapcache
> has a reference count and we should wait PG_lock release before the freeing
> from the swapcache so there is no race condition.

Hmm, any chance a WB device can be async on its own? We add a page
to a new bio and submit it to another async device driver. Then we
return back to the upper layer (swap), which can free a page before
the device picks up a request. Can this happen?

[..]
> However, I decided, at this moment, going this simple way for
> stable-material to solve #0 and #1 problems at the same time.

Agreed. Thanks.

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  5:23           ` Sergey Senozhatsky
@ 2018-08-03  5:45             ` Sergey Senozhatsky
  0 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-08-03  5:45 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Andrew Morton, LKML, Tino Lehnig, stable, Jens Axboe

On (08/03/18 14:23), Sergey Senozhatsky wrote:
> 
> Hmm, any chance a WB device can be async on its own? We add a page
> to a new bio and submit it to another async device driver. Then we
> return back to the upper layer (swap), which can free a page before
> the device picks up a request. Can this happen?

Nevermind these questions.
Sorry for the noise.

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03  2:51   ` Minchan Kim
@ 2018-08-03 23:28     ` Andrew Morton
  2018-08-05 23:15       ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2018-08-03 23:28 UTC (permalink / raw)
  To: Minchan Kim; +Cc: LKML, Sergey Senozhatsky, Tino Lehnig, stable, Jens Axboe

On Fri, 3 Aug 2018 11:51:43 +0900 Minchan Kim <minchan@kernel.org> wrote:

> > Is it legitimate to be altering the bdi capabilities at this level?  Or
> > is this hacky?
> 
> Most of device's bdi capability seems to be static but there are few drivers
> which can change capability. For example, BDI_CAP_STABLE_WRITES
> 
> drivers/scsi/iscsi_tcp.c
> drivers/md/raid5.c
> 
> I believe it's driver itself advertisement stuff so I hope it's not hack.

The bdi is per-disk (per-queue).  So if zram changes the bdi for a
particular partition then it is accidentally mucking with the other
partitions as well, and it shouldn't do that.  At least, I think that's
how it works - it's been a while.  Jens, wake up ;)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature
  2018-08-03 23:28     ` Andrew Morton
@ 2018-08-05 23:15       ` Minchan Kim
  0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-08-05 23:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Tino Lehnig, stable, Jens Axboe

Hi Andrew,

On Fri, Aug 03, 2018 at 04:28:18PM -0700, Andrew Morton wrote:
> On Fri, 3 Aug 2018 11:51:43 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > Is it legitimate to be altering the bdi capabilities at this level?  Or
> > > is this hacky?
> > 
> > Most of device's bdi capability seems to be static but there are few drivers
> > which can change capability. For example, BDI_CAP_STABLE_WRITES
> > 
> > drivers/scsi/iscsi_tcp.c
> > drivers/md/raid5.c
> > 
> > I believe it's driver itself advertisement stuff so I hope it's not hack.
> 
> The bdi is per-disk (per-queue).  So if zram changes the bdi for a
> particular partition then it is accidentally mucking with the other
> partitions as well, and it shouldn't do that.  At least, I think that's

Aha, I didn't know that. Thanks for the heads up.
However, I think zram doesn't support partitioning due to lacking of
probe feature via blk_register_region. ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-08-05 23:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-02  5:11 [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature Minchan Kim
2018-08-02 21:13 ` Andrew Morton
2018-08-03  2:39   ` Sergey Senozhatsky
2018-08-03  2:52     ` Sergey Senozhatsky
2018-08-03  3:00     ` Minchan Kim
2018-08-03  4:13       ` Sergey Senozhatsky
2018-08-03  4:51         ` Minchan Kim
2018-08-03  5:23           ` Sergey Senozhatsky
2018-08-03  5:45             ` Sergey Senozhatsky
2018-08-03  2:51   ` Minchan Kim
2018-08-03 23:28     ` Andrew Morton
2018-08-05 23:15       ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.