All of lore.kernel.org
 help / color / mirror / Atom feed
* LightNVM pblk: read/write of random kernel memory
@ 2017-06-27 23:30 Carl-Daniel Hailfinger
  2017-06-28 11:07 ` Javier Gonzalez
  0 siblings, 1 reply; 7+ messages in thread
From: Carl-Daniel Hailfinger @ 2017-06-27 23:30 UTC (permalink / raw)
  To: Javier González, Matias Bjørling, linux-block

Hi,

I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
https://github.com/OpenChannelSSD/qemu-nvme .

I'm creating a pblk device inside the VM with the following command:
nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
mylightnvmdevice -t pblk

Writing to the pblk device is only partially successful. I can see some
of the content which was written to the pblk device turn up in the
backing store file nvmebackingstore10G.nvme, but mostly the backing
store file contains random kernel memory from the VM. Reading back the
just written contents from the pblk device in the VM also yields random
kernel memory (or at least that's what I think that stuff is, i.e. lots
of strings present in various printk calls).

qemu command line follows:
qemu-nvme.git/x86_64-softmmu/qemu-system-x86_64 -m 4096 -machine
q35,accel=kvm -vga qxl -spice port=5901,addr=127.0.0.1,disable-ticketing
-net nic,model=e1000 -net user -hda
/storage2/vmimages/usefulimages/ubuntu-16.04.2-server-kernel412rc6.qcow2
-drive
file=/storage2/vmimages/nvmebackingstore10G.nvme,if=none,id=mynvme
-device
nvme,drive=mynvme,serial=deadbeef,namespaces=1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_index=3,mdts=10,lnum_lun=1,lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,lbbtable=/storage2/vmimages/nvmebackingstore10G.bbtable,lmetadata=/storage2/vmimages/nvmebackingstore10G.meta,ldebug=1

The backing store file was created with
truncate -s 10G /storage2/vmimages/nvmebackingstore10G.nvme

This might either be a bug in the OpenChannelSSD qemu tree, or it might
be a kernel bug.

I also got warnings like the below:

[   40.842852] pblk: double invalidate
[   40.842872] ------------[ cut here ]------------
[   40.842879] WARNING: CPU: 0 PID: 1065 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:87 __pblk_map_invalidate+0x107/0x150 [pblk]
[   40.842880] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
e_core e1000
[   40.842906] CPU: 0 PID: 1065 Comm: nvme Not tainted 4.12.0-999-generic #201706212201
[   40.842906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   40.842907] task: ffff923077ed6a40 task.stack: ffffa3c2c0a64000
[   40.842910] RIP: 0010:__pblk_map_invalidate+0x107/0x150 [pblk]
[   40.842911] RSP: 0018:ffffa3c2c0a67ab0 EFLAGS: 00010282
[   40.842913] RAX: 0000000000000017 RBX: ffff9230777400a4 RCX: 0000000000000000
[   40.842914] RDX: 0000000000000000 RSI: ffff92307fc0dcc8 RDI: ffff92307fc0dcc8
[   40.842914] RBP: ffffa3c2c0a67ad0 R08: 0000000000000001 R09: 000000000000020b
[   40.842915] R10: 0000000000000000 R11: 000000000000020b R12: ffff923077740000
[   40.842916] R13: ffff923077eb3000 R14: 0000000000000008 R15: ffff923077eb3000
[   40.842918] FS:  00007f95e8dfa700(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
[   40.842919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.842919] CR2: 000056365f4d6ce8 CR3: 0000000171e93000 CR4: 00000000000006f0
[   40.842923] Call Trace:
[   40.842928]  pblk_map_invalidate+0x91/0xa0 [pblk]
[   40.842931]  pblk_update_map+0x62/0x270 [pblk]
[   40.842934]  pblk_recov_l2p+0x338/0x960 [pblk]
[   40.842937]  pblk_init+0x9ca/0xc70 [pblk]
[   40.842940]  ? pblk_init+0x9ca/0xc70 [pblk]
[   40.842943]  ? blkcg_init_queue+0x95/0xe0
[   40.842946]  nvm_ioctl_dev_create.isra.18+0x75e/0xa20
[   40.842948]  ? nvm_ioctl_dev_create.isra.18+0x75e/0xa20
[   40.842951]  nvm_ctl_ioctl+0x23c/0x4c0
[   40.842954]  ? __check_object_size+0xb3/0x190
[   40.842956]  do_vfs_ioctl+0xa3/0x600
[   40.842958]  ? putname+0x54/0x60
[   40.842960]  SyS_ioctl+0x79/0x90
[   40.842962]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[   40.842964] RIP: 0033:0x7f95e8911f07
[   40.842964] RSP: 002b:00007fff007cb048 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   40.842966] RAX: ffffffffffffffda RBX: 00007fff007cd8ca RCX: 00007f95e8911f07
[   40.842967] RDX: 00007fff007cb060 RSI: 0000000040804c22 RDI: 0000000000000003
[   40.842967] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   40.842968] R10: 00007fff007cd900 R11: 0000000000000246 R12: 0000000000000000
[   40.842969] R13: 0000000000000000 R14: 0000000000000004 R15: 00007fff007cd8f9
[   40.842970] Code: 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3 80 3d 6e e3 00 00 00 75 15 48 c7 c7 4c cb 75 c0 c6 05 5e e3 00 00 01 e8 f0 f6 a5 ca <0f> ff 48 89 df c6 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3 
[   40.842994] ---[ end trace 3232b0c9b6044a9f ]---
[   41.018252] pblk: failed to recover L2P
[   41.057078] pblk: failed to recover L2P
[   41.093799] pblk: failed to recover L2P
[   41.130127] pblk: failed to recover L2P
[   41.166476] pblk: failed to recover L2P
[   41.202939] pblk: failed to recover L2P
[   41.239758] pblk: failed to recover L2P
[   41.275851] pblk: failed to recover L2P
[   41.281475] pblk init: luns:1, lines:638, secs:2613248, buf entries:128
[   41.282849]  mylightnvmdevice: mylightnvmdevice1
[  303.375062] EXT4-fs (mylightnvmdevice1): mounted filesystem with ordered data mode. Opts: (null)
[  307.023723] ------------[ cut here ]------------
[  307.023743] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:1323 pblk_line_put+0xb7/0xc0 [pblk]
[  307.023745] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
e_core e1000
[  307.023861] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
[  307.023862] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[  307.023863] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
[  307.023868] RIP: 0010:pblk_line_put+0xb7/0xc0 [pblk]
[  307.023869] RSP: 0018:ffffa3c2c0c83ac8 EFLAGS: 00010293
[  307.023871] RAX: 0000000000000000 RBX: ffff923077eb3000 RCX: 0000000000000000
[  307.023873] RDX: 0000000000000001 RSI: ffff9230777405e0 RDI: ffff9230777405e4
[  307.023874] RBP: ffffa3c2c0c83ae8 R08: 8000000000000009 R09: ffff923078c44040
[  307.023875] R10: ffffa3c2c0a5fd30 R11: 00000000000003af R12: ffff9230777405e0
[  307.023876] R13: ffff9230777405e4 R14: ffff923077740540 R15: ffff9230777405e0
[  307.023878] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
[  307.023880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  307.023881] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
[  307.023886] Call Trace:
[  307.023893]  __pblk_rb_update_l2p+0xcb/0xd0 [pblk]
[  307.023898]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
[  307.023902]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
[  307.023906]  pblk_make_rq+0x71/0x120 [pblk]
[  307.023909]  ? pblk_make_rq+0x71/0x120 [pblk]
[  307.023914]  generic_make_request+0x11e/0x2d0
[  307.023916]  submit_bio+0x73/0x150
[  307.023918]  ? submit_bio+0x73/0x150
[  307.023921]  next_bio+0x38/0x40
[  307.023923]  __blkdev_issue_zeroout+0x159/0x200
[  307.023926]  blkdev_issue_zeroout+0x6c/0xc0
[  307.023931]  ext4_init_inode_table+0x18a/0x380
[  307.023934]  ? ext4_init_inode_table+0x18a/0x380
[  307.023938]  ext4_lazyinit_thread+0x296/0x3b0
[  307.023942]  kthread+0x109/0x140
[  307.023945]  ? init_once+0x80/0x80
[  307.023948]  ? kthread_create_on_node+0x70/0x70
[  307.023951]  ret_from_fork+0x25/0x30
[  307.023952] Code: 49 89 44 24 88 48 89 10 83 43 64 01 c6 07 00 0f 1f 40 00 48 8d bb 68 02 00 00 4c 89 f6 e8 a2 7c 00 00 5b 41 5c 41 5d 41 5e 5d c3 <0f> ff e9 7a ff ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 
[  307.023986] ---[ end trace 3232b0c9b6044aa0 ]---
[  307.023989] refcount_t: underflow; use-after-free.
[  307.024001] ------------[ cut here ]------------
[  307.024006] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/lib/refcount.c:184 refcount_sub_and_test+0x45/0x50
[  307.024006] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvme_core e1000
[  307.024033] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
[  307.024034] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[  307.024035] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
[  307.024037] RIP: 0010:refcount_sub_and_test+0x45/0x50
[  307.024038] RSP: 0018:ffffa3c2c0c83ad8 EFLAGS: 00010286
[  307.024040] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000006
[  307.024041] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92307fc0dcc0
[  307.024043] RBP: ffffa3c2c0c83ad8 R08: 0000000000000001 R09: 0000000000000264
[  307.024044] R10: ffffa3c2c0c83a50 R11: 0000000000000264 R12: ffff923077eb3220
[  307.024045] R13: ffffa3c2c08f1048 R14: ffff923077eb3208 R15: ffff9230777405e0
[  307.024047] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
[  307.024048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  307.024049] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
[  307.024052] Call Trace:
[  307.024055]  refcount_dec_and_test+0x11/0x20
[  307.024059]  __pblk_rb_update_l2p+0x7a/0xd0 [pblk]
[  307.024063]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
[  307.024067]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
[  307.024071]  pblk_make_rq+0x71/0x120 [pblk]
[  307.024074]  ? pblk_make_rq+0x71/0x120 [pblk]
[  307.024076]  generic_make_request+0x11e/0x2d0
[  307.024079]  submit_bio+0x73/0x150
[  307.024081]  ? submit_bio+0x73/0x150
[  307.024083]  next_bio+0x38/0x40
[  307.024085]  __blkdev_issue_zeroout+0x159/0x200
[  307.024087]  blkdev_issue_zeroout+0x6c/0xc0
[  307.024091]  ext4_init_inode_table+0x18a/0x380
[  307.024094]  ? ext4_init_inode_table+0x18a/0x380
[  307.024097]  ext4_lazyinit_thread+0x296/0x3b0
[  307.024100]  kthread+0x109/0x140
[  307.024103]  ? init_once+0x80/0x80
[  307.024105]  ? kthread_create_on_node+0x70/0x70
[  307.024107]  ret_from_fork+0x25/0x30
[  307.024109] Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d 19 55 b2 00 00 75 f4 55 48 c7 c7 88 cd ce 8b c6 05 08 55 b2 00 01 48 89 e5 e8 22 4f d5 ff <0f> ff 31 c0 5d c3 0f 1f 44 00 00 55 48 89 fe bf 01 00 00 00 48 
[  307.024142] ---[ end trace 3232b0c9b6044aa1 ]---
[  484.430925] INFO: task jbd2/mylightnvm:1160 blocked for more than 120 seconds.
[  484.432106]       Tainted: G        W       4.12.0-999-generic #201706212201
[  484.432340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.432567] jbd2/mylightnvm D    0  1160      2 0x00000000
[  484.432570] Call Trace:
[  484.432606]  __schedule+0x3c9/0x850
[  484.432616]  ? __enqueue_entity+0x6c/0x70
[  484.432619]  ? wake_atomic_t_function+0x60/0x60
[  484.432621]  schedule+0x36/0x80
[  484.432628]  jbd2_journal_commit_transaction+0x25e/0x1710
[  484.432630]  ? update_load_avg+0x429/0x5a0


AFAICS pblk should not cause such problems.

Reproducing with qemu is easy. One or two reboots of the VM might be needed.

Side note: I have to recreate the pblk device after every reboot because
it is not recognized automatically.
nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
mylightnvmdevice -t pblk

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
  2017-06-27 23:30 LightNVM pblk: read/write of random kernel memory Carl-Daniel Hailfinger
@ 2017-06-28 11:07 ` Javier Gonzalez
  2017-06-28 14:33   ` Carl-Daniel Hailfinger
  0 siblings, 1 reply; 7+ messages in thread
From: Javier Gonzalez @ 2017-06-28 11:07 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger; +Cc: Matias Bjørling, linux-block

[-- Attachment #1: Type: text/plain, Size: 14748 bytes --]

Hi Carl-Daniel,

I'll take the question here, but please use our github [1] to report
errors and ask questions instead (including this thread). No need to
spam the rest of the linux-block mailing list for LightNVM specific
matters - unless of course, you want to discuss specific parts of the
code.

[1] https://github.com/OpenChannelSSD

> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
> 
> Hi,
> 
> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
> https://github.com/OpenChannelSSD/qemu-nvme .
> 
> I'm creating a pblk device inside the VM with the following command:
> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
> mylightnvmdevice -t pblk

First of all, can you create qemu configuration and pblk instance over
more than 1 LUN? There are some assumptions in QEMU on how we do the
sparse to linear address space transformation that assume several LUNs.
It is possible to change this, if you have a particular use case for
using a single LUN.

Also, for this type of testing add the option -f to the create command.
It will create the instance faster (skipping recovery).

nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
mylightnvmdevice -t pblk -f

> 
> Writing to the pblk device is only partially successful. I can see some
> of the content which was written to the pblk device turn up in the
> backing store file nvmebackingstore10G.nvme, but mostly the backing
> store file contains random kernel memory from the VM. Reading back the
> just written contents from the pblk device in the VM also yields random
> kernel memory (or at least that's what I think that stuff is, i.e. lots
> of strings present in various printk calls).
> 

Can you better define partially succesful? Which workload are you
running on top of the block device exposed by the pblk instance? Is it
failing in any way? The backend file is a binary, so reading it will not
giv you much information about the data being stored.

> qemu command line follows:
> qemu-nvme.git/x86_64-softmmu/qemu-system-x86_64 -m 4096 -machine
> q35,accel=kvm -vga qxl -spice port=5901,addr=127.0.0.1,disable-ticketing
> -net nic,model=e1000 -net user -hda
> /storage2/vmimages/usefulimages/ubuntu-16.04.2-server-kernel412rc6.qcow2
> -drive
> file=/storage2/vmimages/nvmebackingstore10G.nvme,if=none,id=mynvme
> -device
> nvme,drive=mynvme,serial=deadbeef,namespaces=1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_index=3,mdts=10,lnum_lun=1,

As mentioned above, try several with several LUNs.

> lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,lbbtable=/storage2/vmimages/nvmebackingstore10G.bbtable,lmetadata=/storage2/vmimages/nvmebackingstore10G.meta,ldebug=1
> 
> The backing store file was created with
> truncate -s 10G /storage2/vmimages/nvmebackingstore10G.nvme
> 
> This might either be a bug in the OpenChannelSSD qemu tree, or it might
> be a kernel bug.
> 
> I also got warnings like the below:

In the 4.12 patches for pblk we do not have an error state machine. This
is, when writes fail on the device (on qemu in this case), we did not
communicate this to the application. This bad error handling results in
unexpected side-errors like the one you are experiencing. On the patches
for 4.13, we have implemented the error state machine, so this type of
errors should be better handled.

You can pick up the code from out github (linux.git - branch:
pblk.for-4.13) or take it directly form Jens' for-4.13/core

> 
> [   40.842852] pblk: double invalidate
> [   40.842872] ------------[ cut here ]------------
> [   40.842879] WARNING: CPU: 0 PID: 1065 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:87 __pblk_map_invalidate+0x107/0x150 [pblk]
> [   40.842880] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
> e_core e1000
> [   40.842906] CPU: 0 PID: 1065 Comm: nvme Not tainted 4.12.0-999-generic #201706212201
> [   40.842906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [   40.842907] task: ffff923077ed6a40 task.stack: ffffa3c2c0a64000
> [   40.842910] RIP: 0010:__pblk_map_invalidate+0x107/0x150 [pblk]
> [   40.842911] RSP: 0018:ffffa3c2c0a67ab0 EFLAGS: 00010282
> [   40.842913] RAX: 0000000000000017 RBX: ffff9230777400a4 RCX: 0000000000000000
> [   40.842914] RDX: 0000000000000000 RSI: ffff92307fc0dcc8 RDI: ffff92307fc0dcc8
> [   40.842914] RBP: ffffa3c2c0a67ad0 R08: 0000000000000001 R09: 000000000000020b
> [   40.842915] R10: 0000000000000000 R11: 000000000000020b R12: ffff923077740000
> [   40.842916] R13: ffff923077eb3000 R14: 0000000000000008 R15: ffff923077eb3000
> [   40.842918] FS:  00007f95e8dfa700(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
> [   40.842919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   40.842919] CR2: 000056365f4d6ce8 CR3: 0000000171e93000 CR4: 00000000000006f0
> [   40.842923] Call Trace:
> [   40.842928]  pblk_map_invalidate+0x91/0xa0 [pblk]
> [   40.842931]  pblk_update_map+0x62/0x270 [pblk]
> [   40.842934]  pblk_recov_l2p+0x338/0x960 [pblk]
> [   40.842937]  pblk_init+0x9ca/0xc70 [pblk]
> [   40.842940]  ? pblk_init+0x9ca/0xc70 [pblk]
> [   40.842943]  ? blkcg_init_queue+0x95/0xe0
> [   40.842946]  nvm_ioctl_dev_create.isra.18+0x75e/0xa20
> [   40.842948]  ? nvm_ioctl_dev_create.isra.18+0x75e/0xa20
> [   40.842951]  nvm_ctl_ioctl+0x23c/0x4c0
> [   40.842954]  ? __check_object_size+0xb3/0x190
> [   40.842956]  do_vfs_ioctl+0xa3/0x600
> [   40.842958]  ? putname+0x54/0x60
> [   40.842960]  SyS_ioctl+0x79/0x90
> [   40.842962]  entry_SYSCALL_64_fastpath+0x1e/0xa9
> [   40.842964] RIP: 0033:0x7f95e8911f07
> [   40.842964] RSP: 002b:00007fff007cb048 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [   40.842966] RAX: ffffffffffffffda RBX: 00007fff007cd8ca RCX: 00007f95e8911f07
> [   40.842967] RDX: 00007fff007cb060 RSI: 0000000040804c22 RDI: 0000000000000003
> [   40.842967] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [   40.842968] R10: 00007fff007cd900 R11: 0000000000000246 R12: 0000000000000000
> [   40.842969] R13: 0000000000000000 R14: 0000000000000004 R15: 00007fff007cd8f9
> [   40.842970] Code: 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3 80 3d 6e e3 00 00 00 75 15 48 c7 c7 4c cb 75 c0 c6 05 5e e3 00 00 01 e8 f0 f6 a5 ca <0f> ff 48 89 df c6 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3
> [   40.842994] ---[ end trace 3232b0c9b6044a9f ]---
> [   41.018252] pblk: failed to recover L2P
> [   41.057078] pblk: failed to recover L2P
> [   41.093799] pblk: failed to recover L2P
> [   41.130127] pblk: failed to recover L2P
> [   41.166476] pblk: failed to recover L2P
> [   41.202939] pblk: failed to recover L2P
> [   41.239758] pblk: failed to recover L2P
> [   41.275851] pblk: failed to recover L2P
> [   41.281475] pblk init: luns:1, lines:638, secs:2613248, buf entries:128
> [   41.282849]  mylightnvmdevice: mylightnvmdevice1
> [  303.375062] EXT4-fs (mylightnvmdevice1): mounted filesystem with ordered data mode. Opts: (null)
> [  307.023723] ------------[ cut here ]------------
> [  307.023743] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:1323 pblk_line_put+0xb7/0xc0 [pblk]
> [  307.023745] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
> e_core e1000
> [  307.023861] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
> [  307.023862] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [  307.023863] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
> [  307.023868] RIP: 0010:pblk_line_put+0xb7/0xc0 [pblk]
> [  307.023869] RSP: 0018:ffffa3c2c0c83ac8 EFLAGS: 00010293
> [  307.023871] RAX: 0000000000000000 RBX: ffff923077eb3000 RCX: 0000000000000000
> [  307.023873] RDX: 0000000000000001 RSI: ffff9230777405e0 RDI: ffff9230777405e4
> [  307.023874] RBP: ffffa3c2c0c83ae8 R08: 8000000000000009 R09: ffff923078c44040
> [  307.023875] R10: ffffa3c2c0a5fd30 R11: 00000000000003af R12: ffff9230777405e0
> [  307.023876] R13: ffff9230777405e4 R14: ffff923077740540 R15: ffff9230777405e0
> [  307.023878] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
> [  307.023880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  307.023881] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
> [  307.023886] Call Trace:
> [  307.023893]  __pblk_rb_update_l2p+0xcb/0xd0 [pblk]
> [  307.023898]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
> [  307.023902]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
> [  307.023906]  pblk_make_rq+0x71/0x120 [pblk]
> [  307.023909]  ? pblk_make_rq+0x71/0x120 [pblk]
> [  307.023914]  generic_make_request+0x11e/0x2d0
> [  307.023916]  submit_bio+0x73/0x150
> [  307.023918]  ? submit_bio+0x73/0x150
> [  307.023921]  next_bio+0x38/0x40
> [  307.023923]  __blkdev_issue_zeroout+0x159/0x200
> [  307.023926]  blkdev_issue_zeroout+0x6c/0xc0
> [  307.023931]  ext4_init_inode_table+0x18a/0x380
> [  307.023934]  ? ext4_init_inode_table+0x18a/0x380
> [  307.023938]  ext4_lazyinit_thread+0x296/0x3b0
> [  307.023942]  kthread+0x109/0x140
> [  307.023945]  ? init_once+0x80/0x80
> [  307.023948]  ? kthread_create_on_node+0x70/0x70
> [  307.023951]  ret_from_fork+0x25/0x30
> [  307.023952] Code: 49 89 44 24 88 48 89 10 83 43 64 01 c6 07 00 0f 1f 40 00 48 8d bb 68 02 00 00 4c 89 f6 e8 a2 7c 00 00 5b 41 5c 41 5d 41 5e 5d c3 <0f> ff e9 7a ff ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41
> [  307.023986] ---[ end trace 3232b0c9b6044aa0 ]---
> [  307.023989] refcount_t: underflow; use-after-free.
> [  307.024001] ------------[ cut here ]------------
> [  307.024006] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/lib/refcount.c:184 refcount_sub_and_test+0x45/0x50
> [  307.024006] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvme_core e1000
> [  307.024033] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
> [  307.024034] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [  307.024035] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
> [  307.024037] RIP: 0010:refcount_sub_and_test+0x45/0x50
> [  307.024038] RSP: 0018:ffffa3c2c0c83ad8 EFLAGS: 00010286
> [  307.024040] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000006
> [  307.024041] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92307fc0dcc0
> [  307.024043] RBP: ffffa3c2c0c83ad8 R08: 0000000000000001 R09: 0000000000000264
> [  307.024044] R10: ffffa3c2c0c83a50 R11: 0000000000000264 R12: ffff923077eb3220
> [  307.024045] R13: ffffa3c2c08f1048 R14: ffff923077eb3208 R15: ffff9230777405e0
> [  307.024047] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
> [  307.024048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  307.024049] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
> [  307.024052] Call Trace:
> [  307.024055]  refcount_dec_and_test+0x11/0x20
> [  307.024059]  __pblk_rb_update_l2p+0x7a/0xd0 [pblk]
> [  307.024063]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
> [  307.024067]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
> [  307.024071]  pblk_make_rq+0x71/0x120 [pblk]
> [  307.024074]  ? pblk_make_rq+0x71/0x120 [pblk]
> [  307.024076]  generic_make_request+0x11e/0x2d0
> [  307.024079]  submit_bio+0x73/0x150
> [  307.024081]  ? submit_bio+0x73/0x150
> [  307.024083]  next_bio+0x38/0x40
> [  307.024085]  __blkdev_issue_zeroout+0x159/0x200
> [  307.024087]  blkdev_issue_zeroout+0x6c/0xc0
> [  307.024091]  ext4_init_inode_table+0x18a/0x380
> [  307.024094]  ? ext4_init_inode_table+0x18a/0x380
> [  307.024097]  ext4_lazyinit_thread+0x296/0x3b0
> [  307.024100]  kthread+0x109/0x140
> [  307.024103]  ? init_once+0x80/0x80
> [  307.024105]  ? kthread_create_on_node+0x70/0x70
> [  307.024107]  ret_from_fork+0x25/0x30
> [  307.024109] Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d 19 55 b2 00 00 75 f4 55 48 c7 c7 88 cd ce 8b c6 05 08 55 b2 00 01 48 89 e5 e8 22 4f d5 ff <0f> ff 31 c0 5d c3 0f 1f 44 00 00 55 48 89 fe bf 01 00 00 00 48
> [  307.024142] ---[ end trace 3232b0c9b6044aa1 ]---
> [  484.430925] INFO: task jbd2/mylightnvm:1160 blocked for more than 120 seconds.
> [  484.432106]       Tainted: G        W       4.12.0-999-generic #201706212201
> [  484.432340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  484.432567] jbd2/mylightnvm D    0  1160      2 0x00000000
> [  484.432570] Call Trace:
> [  484.432606]  __schedule+0x3c9/0x850
> [  484.432616]  ? __enqueue_entity+0x6c/0x70
> [  484.432619]  ? wake_atomic_t_function+0x60/0x60
> [  484.432621]  schedule+0x36/0x80
> [  484.432628]  jbd2_journal_commit_transaction+0x25e/0x1710
> [  484.432630]  ? update_load_avg+0x429/0x5a0
> 
> 
> AFAICS pblk should not cause such problems.
> 
> Reproducing with qemu is easy. One or two reboots of the VM might be needed.
> 
> Side note: I have to recreate the pblk device after every reboot because
> it is not recognized automatically.
> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
> mylightnvmdevice -t pblk

You need to re-create the instance each time. Think of it as mounting a
file system.

Javier

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
  2017-06-28 11:07 ` Javier Gonzalez
@ 2017-06-28 14:33   ` Carl-Daniel Hailfinger
  2017-06-28 14:58     ` Javier Gonzalez
  0 siblings, 1 reply; 7+ messages in thread
From: Carl-Daniel Hailfinger @ 2017-06-28 14:33 UTC (permalink / raw)
  To: Javier Gonzalez; +Cc: Matias Bjørling, linux-block

Hi Javier,

thanks for the pointer to the github reporting page.
I'll answer your questions here (to make then indexable by search
engines in case someone else stumbles upon this) and link to newly
created github issues for the various problems I encountered.

On 28.06.2017 13:07, Javier Gonzalez wrote:
> I'll take the question here, but please use our github [1] to report
> errors and ask questions instead (including this thread). No need to
> spam the rest of the linux-block mailing list for LightNVM specific
> matters - unless of course, you want to discuss specific parts of the
> code.
>
> [1] https://github.com/OpenChannelSSD
>
>> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>
>> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
>> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
>> https://github.com/OpenChannelSSD/qemu-nvme .
>>
>> I'm creating a pblk device inside the VM with the following command:
>> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
>> mylightnvmdevice -t pblk
> First of all, can you create qemu configuration and pblk instance over
> more than 1 LUN? There are some assumptions in QEMU on how we do the
> sparse to linear address space transformation that assume several LUNs.

Ah, sure, can try. Please document this restriction so others won't step
into that particular trap.
Filed as https://github.com/OpenChannelSSD/linux/issues/29


> It is possible to change this, if you have a particular use case for
> using a single LUN.

I just thought this would make things simpler, but that was an erroneous
conclusion on my side.


> Also, for this type of testing add the option -f to the create command.
> It will create the instance faster (skipping recovery).
>
> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
> mylightnvmdevice -t pblk -f
>
>> Writing to the pblk device is only partially successful. I can see some
>> of the content which was written to the pblk device turn up in the
>> backing store file nvmebackingstore10G.nvme, but mostly the backing
>> store file contains random kernel memory from the VM. Reading back the
>> just written contents from the pblk device in the VM also yields random
>> kernel memory (or at least that's what I think that stuff is, i.e. lots
>> of strings present in various printk calls).
>>
> Can you better define partially succesful?

Some of the contents written to the pblk device inside the vm end up
being written to the backing store, and some regions of the backing
store contain random kernel memory of the vm after a write. I am unable
to detect a pattern there, but random kernel memory should never be
written to disk in any case.


> Which workload are you
> running on top of the block device exposed by the pblk instance? Is it
> failing in any way?

I run fdisk on the instance to create a single partition with maximum
size, then
mkfs.ext4 /dev/mylightnvmdevice1
mount /dev/mylightnvmdevice1 /mnt
yes yes|head -n 4096 >/mnt/yes
umount /mnt

Sometimes this results in an immediate hang during writing /mnt/yes,
sometimes it hangs on umount.
Filed as https://github.com/OpenChannelSSD/linux/issues/28


Inspecting the backing store sometimes yields the expected amount of
data written, sometimes parts of the backing store contain random vm
kernel memory. This random kernel memory can also be read from inside
the vm by hexdumping /dev/mylightnvmdevice .
Filed as https://github.com/OpenChannelSSD/linux/issues/30


> The backend file is a binary, so reading it will not
> giv you much information about the data being stored.

The backend file is binary, but a hexdump shows that the contents of any
written sector ends up there without changes. It's just the location of
the data inside the backing file which is not straightforward (due to
the mapping part of the pblk FTL).


>> qemu command line follows:
>> qemu-nvme.git/x86_64-softmmu/qemu-system-x86_64 -m 4096 -machine
>> q35,accel=kvm -vga qxl -spice port=5901,addr=127.0.0.1,disable-ticketing
>> -net nic,model=e1000 -net user -hda
>> /storage2/vmimages/usefulimages/ubuntu-16.04.2-server-kernel412rc6.qcow2
>> -drive
>> file=/storage2/vmimages/nvmebackingstore10G.nvme,if=none,id=mynvme
>> -device
>> nvme,drive=mynvme,serial=deadbeef,namespaces=1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_index=3,mdts=10,lnum_lun=1,
> As mentioned above, try several with several LUNs.
>
>> lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,lbbtable=/storage2/vmimages/nvmebackingstore10G.bbtable,lmetadata=/storage2/vmimages/nvmebackingstore10G.meta,ldebug=1
>>
>> The backing store file was created with
>> truncate -s 10G /storage2/vmimages/nvmebackingstore10G.nvme
>>
>> This might either be a bug in the OpenChannelSSD qemu tree, or it might
>> be a kernel bug.
>>
>> I also got warnings like the below:
> In the 4.12 patches for pblk we do not have an error state machine. This
> is, when writes fail on the device (on qemu in this case), we did not
> communicate this to the application. This bad error handling results in
> unexpected side-errors like the one you are experiencing. On the patches
> for 4.13, we have implemented the error state machine, so this type of
> errors should be better handled.

Oh. Shouldn't a minimal version of those patches get merged into 4.12
(or 4.12-stable once 4.12 is released) to avoid releasing a kernel with
a data corruption bug?


> You can pick up the code from out github (linux.git - branch:
> pblk.for-4.13) or take it directly form Jens' for-4.13/core

Thanks. A full kernel compile will take some time, though. Do you happen
to have a Ubuntu-compatible kernel .deb for the new code?

 
>> [   40.842852] pblk: double invalidate
>> [   40.842872] ------------[ cut here ]------------
>> [   40.842879] WARNING: CPU: 0 PID: 1065 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:87 __pblk_map_invalidate+0x107/0x150 [pblk]
>> [   40.842880] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
>> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
>> e_core e1000
>> [   40.842906] CPU: 0 PID: 1065 Comm: nvme Not tainted 4.12.0-999-generic #201706212201
>> [   40.842906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [   40.842907] task: ffff923077ed6a40 task.stack: ffffa3c2c0a64000
>> [   40.842910] RIP: 0010:__pblk_map_invalidate+0x107/0x150 [pblk]
>> [   40.842911] RSP: 0018:ffffa3c2c0a67ab0 EFLAGS: 00010282
>> [   40.842913] RAX: 0000000000000017 RBX: ffff9230777400a4 RCX: 0000000000000000
>> [   40.842914] RDX: 0000000000000000 RSI: ffff92307fc0dcc8 RDI: ffff92307fc0dcc8
>> [   40.842914] RBP: ffffa3c2c0a67ad0 R08: 0000000000000001 R09: 000000000000020b
>> [   40.842915] R10: 0000000000000000 R11: 000000000000020b R12: ffff923077740000
>> [   40.842916] R13: ffff923077eb3000 R14: 0000000000000008 R15: ffff923077eb3000
>> [   40.842918] FS:  00007f95e8dfa700(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>> [   40.842919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   40.842919] CR2: 000056365f4d6ce8 CR3: 0000000171e93000 CR4: 00000000000006f0
>> [   40.842923] Call Trace:
>> [   40.842928]  pblk_map_invalidate+0x91/0xa0 [pblk]
>> [   40.842931]  pblk_update_map+0x62/0x270 [pblk]
>> [   40.842934]  pblk_recov_l2p+0x338/0x960 [pblk]
>> [   40.842937]  pblk_init+0x9ca/0xc70 [pblk]
>> [   40.842940]  ? pblk_init+0x9ca/0xc70 [pblk]
>> [   40.842943]  ? blkcg_init_queue+0x95/0xe0
>> [   40.842946]  nvm_ioctl_dev_create.isra.18+0x75e/0xa20
>> [   40.842948]  ? nvm_ioctl_dev_create.isra.18+0x75e/0xa20
>> [   40.842951]  nvm_ctl_ioctl+0x23c/0x4c0
>> [   40.842954]  ? __check_object_size+0xb3/0x190
>> [   40.842956]  do_vfs_ioctl+0xa3/0x600
>> [   40.842958]  ? putname+0x54/0x60
>> [   40.842960]  SyS_ioctl+0x79/0x90
>> [   40.842962]  entry_SYSCALL_64_fastpath+0x1e/0xa9
>> [   40.842964] RIP: 0033:0x7f95e8911f07
>> [   40.842964] RSP: 002b:00007fff007cb048 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [   40.842966] RAX: ffffffffffffffda RBX: 00007fff007cd8ca RCX: 00007f95e8911f07
>> [   40.842967] RDX: 00007fff007cb060 RSI: 0000000040804c22 RDI: 0000000000000003
>> [   40.842967] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [   40.842968] R10: 00007fff007cd900 R11: 0000000000000246 R12: 0000000000000000
>> [   40.842969] R13: 0000000000000000 R14: 0000000000000004 R15: 00007fff007cd8f9
>> [   40.842970] Code: 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3 80 3d 6e e3 00 00 00 75 15 48 c7 c7 4c cb 75 c0 c6 05 5e e3 00 00 01 e8 f0 f6 a5 ca <0f> ff 48 89 df c6 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3
>> [   40.842994] ---[ end trace 3232b0c9b6044a9f ]---
>> [   41.018252] pblk: failed to recover L2P
>> [   41.057078] pblk: failed to recover L2P
>> [   41.093799] pblk: failed to recover L2P
>> [   41.130127] pblk: failed to recover L2P
>> [   41.166476] pblk: failed to recover L2P
>> [   41.202939] pblk: failed to recover L2P
>> [   41.239758] pblk: failed to recover L2P
>> [   41.275851] pblk: failed to recover L2P
>> [   41.281475] pblk init: luns:1, lines:638, secs:2613248, buf entries:128
>> [   41.282849]  mylightnvmdevice: mylightnvmdevice1
>> [  303.375062] EXT4-fs (mylightnvmdevice1): mounted filesystem with ordered data mode. Opts: (null)
>> [  307.023723] ------------[ cut here ]------------
>> [  307.023743] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:1323 pblk_line_put+0xb7/0xc0 [pblk]
>> [  307.023745] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
>> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
>> e_core e1000
>> [  307.023861] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
>> [  307.023862] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [  307.023863] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
>> [  307.023868] RIP: 0010:pblk_line_put+0xb7/0xc0 [pblk]
>> [  307.023869] RSP: 0018:ffffa3c2c0c83ac8 EFLAGS: 00010293
>> [  307.023871] RAX: 0000000000000000 RBX: ffff923077eb3000 RCX: 0000000000000000
>> [  307.023873] RDX: 0000000000000001 RSI: ffff9230777405e0 RDI: ffff9230777405e4
>> [  307.023874] RBP: ffffa3c2c0c83ae8 R08: 8000000000000009 R09: ffff923078c44040
>> [  307.023875] R10: ffffa3c2c0a5fd30 R11: 00000000000003af R12: ffff9230777405e0
>> [  307.023876] R13: ffff9230777405e4 R14: ffff923077740540 R15: ffff9230777405e0
>> [  307.023878] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>> [  307.023880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  307.023881] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
>> [  307.023886] Call Trace:
>> [  307.023893]  __pblk_rb_update_l2p+0xcb/0xd0 [pblk]
>> [  307.023898]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
>> [  307.023902]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
>> [  307.023906]  pblk_make_rq+0x71/0x120 [pblk]
>> [  307.023909]  ? pblk_make_rq+0x71/0x120 [pblk]
>> [  307.023914]  generic_make_request+0x11e/0x2d0
>> [  307.023916]  submit_bio+0x73/0x150
>> [  307.023918]  ? submit_bio+0x73/0x150
>> [  307.023921]  next_bio+0x38/0x40
>> [  307.023923]  __blkdev_issue_zeroout+0x159/0x200
>> [  307.023926]  blkdev_issue_zeroout+0x6c/0xc0
>> [  307.023931]  ext4_init_inode_table+0x18a/0x380
>> [  307.023934]  ? ext4_init_inode_table+0x18a/0x380
>> [  307.023938]  ext4_lazyinit_thread+0x296/0x3b0
>> [  307.023942]  kthread+0x109/0x140
>> [  307.023945]  ? init_once+0x80/0x80
>> [  307.023948]  ? kthread_create_on_node+0x70/0x70
>> [  307.023951]  ret_from_fork+0x25/0x30
>> [  307.023952] Code: 49 89 44 24 88 48 89 10 83 43 64 01 c6 07 00 0f 1f 40 00 48 8d bb 68 02 00 00 4c 89 f6 e8 a2 7c 00 00 5b 41 5c 41 5d 41 5e 5d c3 <0f> ff e9 7a ff ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41
>> [  307.023986] ---[ end trace 3232b0c9b6044aa0 ]---
>> [  307.023989] refcount_t: underflow; use-after-free.
>> [  307.024001] ------------[ cut here ]------------
>> [  307.024006] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/lib/refcount.c:184 refcount_sub_and_test+0x45/0x50
>> [  307.024006] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvme_core e1000
>> [  307.024033] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
>> [  307.024034] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [  307.024035] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
>> [  307.024037] RIP: 0010:refcount_sub_and_test+0x45/0x50
>> [  307.024038] RSP: 0018:ffffa3c2c0c83ad8 EFLAGS: 00010286
>> [  307.024040] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000006
>> [  307.024041] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92307fc0dcc0
>> [  307.024043] RBP: ffffa3c2c0c83ad8 R08: 0000000000000001 R09: 0000000000000264
>> [  307.024044] R10: ffffa3c2c0c83a50 R11: 0000000000000264 R12: ffff923077eb3220
>> [  307.024045] R13: ffffa3c2c08f1048 R14: ffff923077eb3208 R15: ffff9230777405e0
>> [  307.024047] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>> [  307.024048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  307.024049] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
>> [  307.024052] Call Trace:
>> [  307.024055]  refcount_dec_and_test+0x11/0x20
>> [  307.024059]  __pblk_rb_update_l2p+0x7a/0xd0 [pblk]
>> [  307.024063]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
>> [  307.024067]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
>> [  307.024071]  pblk_make_rq+0x71/0x120 [pblk]
>> [  307.024074]  ? pblk_make_rq+0x71/0x120 [pblk]
>> [  307.024076]  generic_make_request+0x11e/0x2d0
>> [  307.024079]  submit_bio+0x73/0x150
>> [  307.024081]  ? submit_bio+0x73/0x150
>> [  307.024083]  next_bio+0x38/0x40
>> [  307.024085]  __blkdev_issue_zeroout+0x159/0x200
>> [  307.024087]  blkdev_issue_zeroout+0x6c/0xc0
>> [  307.024091]  ext4_init_inode_table+0x18a/0x380
>> [  307.024094]  ? ext4_init_inode_table+0x18a/0x380
>> [  307.024097]  ext4_lazyinit_thread+0x296/0x3b0
>> [  307.024100]  kthread+0x109/0x140
>> [  307.024103]  ? init_once+0x80/0x80
>> [  307.024105]  ? kthread_create_on_node+0x70/0x70
>> [  307.024107]  ret_from_fork+0x25/0x30
>> [  307.024109] Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d 19 55 b2 00 00 75 f4 55 48 c7 c7 88 cd ce 8b c6 05 08 55 b2 00 01 48 89 e5 e8 22 4f d5 ff <0f> ff 31 c0 5d c3 0f 1f 44 00 00 55 48 89 fe bf 01 00 00 00 48
>> [  307.024142] ---[ end trace 3232b0c9b6044aa1 ]---
>> [  484.430925] INFO: task jbd2/mylightnvm:1160 blocked for more than 120 seconds.
>> [  484.432106]       Tainted: G        W       4.12.0-999-generic #201706212201
>> [  484.432340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  484.432567] jbd2/mylightnvm D    0  1160      2 0x00000000
>> [  484.432570] Call Trace:
>> [  484.432606]  __schedule+0x3c9/0x850
>> [  484.432616]  ? __enqueue_entity+0x6c/0x70
>> [  484.432619]  ? wake_atomic_t_function+0x60/0x60
>> [  484.432621]  schedule+0x36/0x80
>> [  484.432628]  jbd2_journal_commit_transaction+0x25e/0x1710
>> [  484.432630]  ? update_load_avg+0x429/0x5a0
>>
>>
>> AFAICS pblk should not cause such problems.
>>
>> Reproducing with qemu is easy. One or two reboots of the VM might be needed.
>>
>> Side note: I have to recreate the pblk device after every reboot because
>> it is not recognized automatically.
>> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
>> mylightnvmdevice -t pblk
> You need to re-create the instance each time. Think of it as mounting a
> file system.

Oh. I had assumed that this operation was more akin to creating a LVM
logical volume.
If I re-create the instance with differing values (e.g. different LUNs),
will the data still be there or will creation be refused to avoid data loss?
Filed as https://github.com/OpenChannelSSD/linux/issues/31

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
  2017-06-28 14:33   ` Carl-Daniel Hailfinger
@ 2017-06-28 14:58     ` Javier Gonzalez
  2017-06-28 15:14       ` Carl-Daniel Hailfinger
                         ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Javier Gonzalez @ 2017-06-28 14:58 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger; +Cc: Matias Bjørling, linux-block

[-- Attachment #1: Type: text/plain, Size: 18424 bytes --]

> On 28 Jun 2017, at 16.33, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
> 
> Hi Javier,
> 
> thanks for the pointer to the github reporting page.
> I'll answer your questions here (to make then indexable by search
> engines in case someone else stumbles upon this) and link to newly
> created github issues for the various problems I encountered.
> 

Ok. I answered each issue directly on the github. A couple og things
inline though, for completion.

> On 28.06.2017 13:07, Javier Gonzalez wrote:
>> I'll take the question here, but please use our github [1] to report
>> errors and ask questions instead (including this thread). No need to
>> spam the rest of the linux-block mailing list for LightNVM specific
>> matters - unless of course, you want to discuss specific parts of the
>> code.
>> 
>> [1] https://github.com/OpenChannelSSD
>> 
>>> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>> 
>>> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
>>> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
>>> https://github.com/OpenChannelSSD/qemu-nvme .
>>> 
>>> I'm creating a pblk device inside the VM with the following command:
>>> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
>>> mylightnvmdevice -t pblk
>> First of all, can you create qemu configuration and pblk instance over
>> more than 1 LUN? There are some assumptions in QEMU on how we do the
>> sparse to linear address space transformation that assume several LUNs.
> 
> Ah, sure, can try. Please document this restriction so others won't step
> into that particular trap.
> Filed as https://github.com/OpenChannelSSD/linux/issues/29
> 
> 
>> It is possible to change this, if you have a particular use case for
>> using a single LUN.
> 
> I just thought this would make things simpler, but that was an erroneous
> conclusion on my side.
> 
> 
>> Also, for this type of testing add the option -f to the create command.
>> It will create the instance faster (skipping recovery).
>> 
>> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
>> mylightnvmdevice -t pblk -f
>> 
>>> Writing to the pblk device is only partially successful. I can see some
>>> of the content which was written to the pblk device turn up in the
>>> backing store file nvmebackingstore10G.nvme, but mostly the backing
>>> store file contains random kernel memory from the VM. Reading back the
>>> just written contents from the pblk device in the VM also yields random
>>> kernel memory (or at least that's what I think that stuff is, i.e. lots
>>> of strings present in various printk calls).
>> Can you better define partially succesful?
> 
> Some of the contents written to the pblk device inside the vm end up
> being written to the backing store, and some regions of the backing
> store contain random kernel memory of the vm after a write. I am unable
> to detect a pattern there, but random kernel memory should never be
> written to disk in any case.
> 
> 
>> Which workload are you
>> running on top of the block device exposed by the pblk instance? Is it
>> failing in any way?
> 
> I run fdisk on the instance to create a single partition with maximum
> size, then
> mkfs.ext4 /dev/mylightnvmdevice1
> mount /dev/mylightnvmdevice1 /mnt
> yes yes|head -n 4096 >/mnt/yes
> umount /mnt
> 
> Sometimes this results in an immediate hang during writing /mnt/yes,
> sometimes it hangs on umount.
> Filed as https://github.com/OpenChannelSSD/linux/issues/28
> 
> 
> Inspecting the backing store sometimes yields the expected amount of
> data written, sometimes parts of the backing store contain random vm
> kernel memory. This random kernel memory can also be read from inside
> the vm by hexdumping /dev/mylightnvmdevice .
> Filed as https://github.com/OpenChannelSSD/linux/issues/30
> 
> 
>> The backend file is a binary, so reading it will not
>> giv you much information about the data being stored.
> 
> The backend file is binary, but a hexdump shows that the contents of any
> written sector ends up there without changes. It's just the location of
> the data inside the backing file which is not straightforward (due to
> the mapping part of the pblk FTL).
> 
> 
>>> qemu command line follows:
>>> qemu-nvme.git/x86_64-softmmu/qemu-system-x86_64 -m 4096 -machine
>>> q35,accel=kvm -vga qxl -spice port=5901,addr=127.0.0.1,disable-ticketing
>>> -net nic,model=e1000 -net user -hda
>>> /storage2/vmimages/usefulimages/ubuntu-16.04.2-server-kernel412rc6.qcow2
>>> -drive
>>> file=/storage2/vmimages/nvmebackingstore10G.nvme,if=none,id=mynvme
>>> -device
>>> nvme,drive=mynvme,serial=deadbeef,namespaces=1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_index=3,mdts=10,lnum_lun=1,
>> As mentioned above, try several with several LUNs.
>> 
>>> lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,lbbtable=/storage2/vmimages/nvmebackingstore10G.bbtable,lmetadata=/storage2/vmimages/nvmebackingstore10G.meta,ldebug=1
>>> 
>>> The backing store file was created with
>>> truncate -s 10G /storage2/vmimages/nvmebackingstore10G.nvme
>>> 
>>> This might either be a bug in the OpenChannelSSD qemu tree, or it might
>>> be a kernel bug.
>>> 
>>> I also got warnings like the below:
>> In the 4.12 patches for pblk we do not have an error state machine. This
>> is, when writes fail on the device (on qemu in this case), we did not
>> communicate this to the application. This bad error handling results in
>> unexpected side-errors like the one you are experiencing. On the patches
>> for 4.13, we have implemented the error state machine, so this type of
>> errors should be better handled.
> 
> Oh. Shouldn't a minimal version of those patches get merged into 4.12
> (or 4.12-stable once 4.12 is released) to avoid releasing a kernel with
> a data corruption bug?

This is only in case the device fails, how we handle the error on the
host. If the device is not accepting writes for some reason, data is
lost anyway. So I don't think we need the fix for stable.

> 
> 
>> You can pick up the code from out github (linux.git - branch:
>> pblk.for-4.13) or take it directly form Jens' for-4.13/core
> 
> Thanks. A full kernel compile will take some time, though. Do you happen
> to have a Ubuntu-compatible kernel .deb for the new code?

We thought about, but never actually did it (to share at least). I see
it might be useful :) For the time being, I'll share a minimal .config
for qemu, which takes a couple of minutes to compile.

> 
> 
>>> [   40.842852] pblk: double invalidate
>>> [   40.842872] ------------[ cut here ]------------
>>> [   40.842879] WARNING: CPU: 0 PID: 1065 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:87 __pblk_map_invalidate+0x107/0x150 [pblk]
>>> [   40.842880] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
>>> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
>>> e_core e1000
>>> [   40.842906] CPU: 0 PID: 1065 Comm: nvme Not tainted 4.12.0-999-generic #201706212201
>>> [   40.842906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>> [   40.842907] task: ffff923077ed6a40 task.stack: ffffa3c2c0a64000
>>> [   40.842910] RIP: 0010:__pblk_map_invalidate+0x107/0x150 [pblk]
>>> [   40.842911] RSP: 0018:ffffa3c2c0a67ab0 EFLAGS: 00010282
>>> [   40.842913] RAX: 0000000000000017 RBX: ffff9230777400a4 RCX: 0000000000000000
>>> [   40.842914] RDX: 0000000000000000 RSI: ffff92307fc0dcc8 RDI: ffff92307fc0dcc8
>>> [   40.842914] RBP: ffffa3c2c0a67ad0 R08: 0000000000000001 R09: 000000000000020b
>>> [   40.842915] R10: 0000000000000000 R11: 000000000000020b R12: ffff923077740000
>>> [   40.842916] R13: ffff923077eb3000 R14: 0000000000000008 R15: ffff923077eb3000
>>> [   40.842918] FS:  00007f95e8dfa700(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>>> [   40.842919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   40.842919] CR2: 000056365f4d6ce8 CR3: 0000000171e93000 CR4: 00000000000006f0
>>> [   40.842923] Call Trace:
>>> [   40.842928]  pblk_map_invalidate+0x91/0xa0 [pblk]
>>> [   40.842931]  pblk_update_map+0x62/0x270 [pblk]
>>> [   40.842934]  pblk_recov_l2p+0x338/0x960 [pblk]
>>> [   40.842937]  pblk_init+0x9ca/0xc70 [pblk]
>>> [   40.842940]  ? pblk_init+0x9ca/0xc70 [pblk]
>>> [   40.842943]  ? blkcg_init_queue+0x95/0xe0
>>> [   40.842946]  nvm_ioctl_dev_create.isra.18+0x75e/0xa20
>>> [   40.842948]  ? nvm_ioctl_dev_create.isra.18+0x75e/0xa20
>>> [   40.842951]  nvm_ctl_ioctl+0x23c/0x4c0
>>> [   40.842954]  ? __check_object_size+0xb3/0x190
>>> [   40.842956]  do_vfs_ioctl+0xa3/0x600
>>> [   40.842958]  ? putname+0x54/0x60
>>> [   40.842960]  SyS_ioctl+0x79/0x90
>>> [   40.842962]  entry_SYSCALL_64_fastpath+0x1e/0xa9
>>> [   40.842964] RIP: 0033:0x7f95e8911f07
>>> [   40.842964] RSP: 002b:00007fff007cb048 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [   40.842966] RAX: ffffffffffffffda RBX: 00007fff007cd8ca RCX: 00007f95e8911f07
>>> [   40.842967] RDX: 00007fff007cb060 RSI: 0000000040804c22 RDI: 0000000000000003
>>> [   40.842967] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>> [   40.842968] R10: 00007fff007cd900 R11: 0000000000000246 R12: 0000000000000000
>>> [   40.842969] R13: 0000000000000000 R14: 0000000000000004 R15: 00007fff007cd8f9
>>> [   40.842970] Code: 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3 80 3d 6e e3 00 00 00 75 15 48 c7 c7 4c cb 75 c0 c6 05 5e e3 00 00 01 e8 f0 f6 a5 ca <0f> ff 48 89 df c6 07 00 0f 1f 40 00 5b 41 5c 41 5d 41 5e 5d c3
>>> [   40.842994] ---[ end trace 3232b0c9b6044a9f ]---
>>> [   41.018252] pblk: failed to recover L2P
>>> [   41.057078] pblk: failed to recover L2P
>>> [   41.093799] pblk: failed to recover L2P
>>> [   41.130127] pblk: failed to recover L2P
>>> [   41.166476] pblk: failed to recover L2P
>>> [   41.202939] pblk: failed to recover L2P
>>> [   41.239758] pblk: failed to recover L2P
>>> [   41.275851] pblk: failed to recover L2P
>>> [   41.281475] pblk init: luns:1, lines:638, secs:2613248, buf entries:128
>>> [   41.282849]  mylightnvmdevice: mylightnvmdevice1
>>> [  303.375062] EXT4-fs (mylightnvmdevice1): mounted filesystem with ordered data mode. Opts: (null)
>>> [  307.023723] ------------[ cut here ]------------
>>> [  307.023743] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/drivers/lightnvm/pblk-core.c:1323 pblk_line_put+0xb7/0xc0 [pblk]
>>> [  307.023745] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 r
>>> aid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvm
>>> e_core e1000
>>> [  307.023861] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
>>> [  307.023862] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>> [  307.023863] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
>>> [  307.023868] RIP: 0010:pblk_line_put+0xb7/0xc0 [pblk]
>>> [  307.023869] RSP: 0018:ffffa3c2c0c83ac8 EFLAGS: 00010293
>>> [  307.023871] RAX: 0000000000000000 RBX: ffff923077eb3000 RCX: 0000000000000000
>>> [  307.023873] RDX: 0000000000000001 RSI: ffff9230777405e0 RDI: ffff9230777405e4
>>> [  307.023874] RBP: ffffa3c2c0c83ae8 R08: 8000000000000009 R09: ffff923078c44040
>>> [  307.023875] R10: ffffa3c2c0a5fd30 R11: 00000000000003af R12: ffff9230777405e0
>>> [  307.023876] R13: ffff9230777405e4 R14: ffff923077740540 R15: ffff9230777405e0
>>> [  307.023878] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>>> [  307.023880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  307.023881] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
>>> [  307.023886] Call Trace:
>>> [  307.023893]  __pblk_rb_update_l2p+0xcb/0xd0 [pblk]
>>> [  307.023898]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
>>> [  307.023902]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
>>> [  307.023906]  pblk_make_rq+0x71/0x120 [pblk]
>>> [  307.023909]  ? pblk_make_rq+0x71/0x120 [pblk]
>>> [  307.023914]  generic_make_request+0x11e/0x2d0
>>> [  307.023916]  submit_bio+0x73/0x150
>>> [  307.023918]  ? submit_bio+0x73/0x150
>>> [  307.023921]  next_bio+0x38/0x40
>>> [  307.023923]  __blkdev_issue_zeroout+0x159/0x200
>>> [  307.023926]  blkdev_issue_zeroout+0x6c/0xc0
>>> [  307.023931]  ext4_init_inode_table+0x18a/0x380
>>> [  307.023934]  ? ext4_init_inode_table+0x18a/0x380
>>> [  307.023938]  ext4_lazyinit_thread+0x296/0x3b0
>>> [  307.023942]  kthread+0x109/0x140
>>> [  307.023945]  ? init_once+0x80/0x80
>>> [  307.023948]  ? kthread_create_on_node+0x70/0x70
>>> [  307.023951]  ret_from_fork+0x25/0x30
>>> [  307.023952] Code: 49 89 44 24 88 48 89 10 83 43 64 01 c6 07 00 0f 1f 40 00 48 8d bb 68 02 00 00 4c 89 f6 e8 a2 7c 00 00 5b 41 5c 41 5d 41 5e 5d c3 <0f> ff e9 7a ff ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41
>>> [  307.023986] ---[ end trace 3232b0c9b6044aa0 ]---
>>> [  307.023989] refcount_t: underflow; use-after-free.
>>> [  307.024001] ------------[ cut here ]------------
>>> [  307.024006] WARNING: CPU: 0 PID: 1162 at /home/kernel/COD/linux/lib/refcount.c:184 refcount_sub_and_test+0x45/0x50
>>> [  307.024006] Modules linked in: pblk ppdev joydev input_leds mac_hid serio_raw lpc_ich parport_pc pvpanic parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm floppy ahci libahci nvme nvme_core e1000
>>> [  307.024033] CPU: 0 PID: 1162 Comm: ext4lazyinit Tainted: G        W       4.12.0-999-generic #201706212201
>>> [  307.024034] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>> [  307.024035] task: ffff923072ff8000 task.stack: ffffa3c2c0c80000
>>> [  307.024037] RIP: 0010:refcount_sub_and_test+0x45/0x50
>>> [  307.024038] RSP: 0018:ffffa3c2c0c83ad8 EFLAGS: 00010286
>>> [  307.024040] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000006
>>> [  307.024041] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92307fc0dcc0
>>> [  307.024043] RBP: ffffa3c2c0c83ad8 R08: 0000000000000001 R09: 0000000000000264
>>> [  307.024044] R10: ffffa3c2c0c83a50 R11: 0000000000000264 R12: ffff923077eb3220
>>> [  307.024045] R13: ffffa3c2c08f1048 R14: ffff923077eb3208 R15: ffff9230777405e0
>>> [  307.024047] FS:  0000000000000000(0000) GS:ffff92307fc00000(0000) knlGS:0000000000000000
>>> [  307.024048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  307.024049] CR2: 0000000001701808 CR3: 00000001788ee000 CR4: 00000000000006f0
>>> [  307.024052] Call Trace:
>>> [  307.024055]  refcount_dec_and_test+0x11/0x20
>>> [  307.024059]  __pblk_rb_update_l2p+0x7a/0xd0 [pblk]
>>> [  307.024063]  pblk_rb_may_write_user+0x15f/0x180 [pblk]
>>> [  307.024067]  pblk_write_to_cache+0x6f/0x1b0 [pblk]
>>> [  307.024071]  pblk_make_rq+0x71/0x120 [pblk]
>>> [  307.024074]  ? pblk_make_rq+0x71/0x120 [pblk]
>>> [  307.024076]  generic_make_request+0x11e/0x2d0
>>> [  307.024079]  submit_bio+0x73/0x150
>>> [  307.024081]  ? submit_bio+0x73/0x150
>>> [  307.024083]  next_bio+0x38/0x40
>>> [  307.024085]  __blkdev_issue_zeroout+0x159/0x200
>>> [  307.024087]  blkdev_issue_zeroout+0x6c/0xc0
>>> [  307.024091]  ext4_init_inode_table+0x18a/0x380
>>> [  307.024094]  ? ext4_init_inode_table+0x18a/0x380
>>> [  307.024097]  ext4_lazyinit_thread+0x296/0x3b0
>>> [  307.024100]  kthread+0x109/0x140
>>> [  307.024103]  ? init_once+0x80/0x80
>>> [  307.024105]  ? kthread_create_on_node+0x70/0x70
>>> [  307.024107]  ret_from_fork+0x25/0x30
>>> [  307.024109] Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d 19 55 b2 00 00 75 f4 55 48 c7 c7 88 cd ce 8b c6 05 08 55 b2 00 01 48 89 e5 e8 22 4f d5 ff <0f> ff 31 c0 5d c3 0f 1f 44 00 00 55 48 89 fe bf 01 00 00 00 48
>>> [  307.024142] ---[ end trace 3232b0c9b6044aa1 ]---
>>> [  484.430925] INFO: task jbd2/mylightnvm:1160 blocked for more than 120 seconds.
>>> [  484.432106]       Tainted: G        W       4.12.0-999-generic #201706212201
>>> [  484.432340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [  484.432567] jbd2/mylightnvm D    0  1160      2 0x00000000
>>> [  484.432570] Call Trace:
>>> [  484.432606]  __schedule+0x3c9/0x850
>>> [  484.432616]  ? __enqueue_entity+0x6c/0x70
>>> [  484.432619]  ? wake_atomic_t_function+0x60/0x60
>>> [  484.432621]  schedule+0x36/0x80
>>> [  484.432628]  jbd2_journal_commit_transaction+0x25e/0x1710
>>> [  484.432630]  ? update_load_avg+0x429/0x5a0
>>> 
>>> 
>>> AFAICS pblk should not cause such problems.
>>> 
>>> Reproducing with qemu is easy. One or two reboots of the VM might be needed.
>>> 
>>> Side note: I have to recreate the pblk device after every reboot because
>>> it is not recognized automatically.
>>> nvme lvnm create -d nvme0n1 --lun-begin=0 --lun-end=0 -n
>>> mylightnvmdevice -t pblk
>> You need to re-create the instance each time. Think of it as mounting a
>> file system.
> 
> Oh. I had assumed that this operation was more akin to creating a LVM
> logical volume.
> If I re-create the instance with differing values (e.g. different LUNs),
> will the data still be there or will creation be refused to avoid data loss?
> Filed as https://github.com/OpenChannelSSD/linux/issues/31
> 
> Regards,
> Carl-Daniel

Javier

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
  2017-06-28 14:58     ` Javier Gonzalez
@ 2017-06-28 15:14       ` Carl-Daniel Hailfinger
       [not found]       ` <4F70E259-B8AB-4C99-9ABC-4D7B6F7169DF@cnexlabs.com>
  2017-06-30 14:05       ` Carl-Daniel Hailfinger
  2 siblings, 0 replies; 7+ messages in thread
From: Carl-Daniel Hailfinger @ 2017-06-28 15:14 UTC (permalink / raw)
  To: Javier Gonzalez; +Cc: Matias Bjørling, linux-block

Hi Javier,

On 28.06.2017 16:58, Javier Gonzalez wrote:
>> On 28 Jun 2017, at 16.33, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>
>> thanks for the pointer to the github reporting page.
>> I'll answer your questions here (to make then indexable by search
>> engines in case someone else stumbles upon this) and link to newly
>> created github issues for the various problems I encountered.
>>
> Ok. I answered each issue directly on the github. A couple og things
> inline though, for completion.

Thanks.

 
>> On 28.06.2017 13:07, Javier Gonzalez wrote:
>>> I'll take the question here, but please use our github [1] to report
>>> errors and ask questions instead (including this thread). No need to
>>> spam the rest of the linux-block mailing list for LightNVM specific
>>> matters - unless of course, you want to discuss specific parts of the
>>> code.
>>>
>>> [1] https://github.com/OpenChannelSSD
>>>
>>>> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>>>
>>>> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
>>>> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
>>>> https://github.com/OpenChannelSSD/qemu-nvme .
>>>>
>>>> I'm creating a pblk device inside the VM with the following command:
>>>> [...]
>>>>
>>>> This might either be a bug in the OpenChannelSSD qemu tree, or it might
>>>> be a kernel bug.
>>>>
>>>> I also got warnings like the below:
>>> In the 4.12 patches for pblk we do not have an error state machine. This
>>> is, when writes fail on the device (on qemu in this case), we did not
>>> communicate this to the application. This bad error handling results in
>>> unexpected side-errors like the one you are experiencing. On the patches
>>> for 4.13, we have implemented the error state machine, so this type of
>>> errors should be better handled.
>> Oh. Shouldn't a minimal version of those patches get merged into 4.12
>> (or 4.12-stable once 4.12 is released) to avoid releasing a kernel with
>> a data corruption bug?
> This is only in case the device fails, how we handle the error on the
> host. If the device is not accepting writes for some reason, data is
> lost anyway. So I don't think we need the fix for stable.

This is odd. AFAICS qemu isn't configured to simulate device failure, so
in theory this should never have happened. Can you think of any reason
why this code path was triggered? Should I open a separate github issue
for that?


>>> You can pick up the code from out github (linux.git - branch:
>>> pblk.for-4.13) or take it directly form Jens' for-4.13/core
>> Thanks. A full kernel compile will take some time, though. Do you happen
>> to have a Ubuntu-compatible kernel .deb for the new code?
> We thought about, but never actually did it (to share at least). I see
> it might be useful :) For the time being, I'll share a minimal .config
> for qemu, which takes a couple of minutes to compile.

Thanks!

 
>> [various bugs]
>> Filed as https://github.com/OpenChannelSSD/linux/issues/28
>> Filed as https://github.com/OpenChannelSSD/linux/issues/29
>> Filed as https://github.com/OpenChannelSSD/linux/issues/30
>> Filed as https://github.com/OpenChannelSSD/linux/issues/31
>>
>> Regards,
>> Carl-Daniel
> Javier

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
       [not found]       ` <4F70E259-B8AB-4C99-9ABC-4D7B6F7169DF@cnexlabs.com>
@ 2017-06-28 15:19         ` Javier Gonzalez
  0 siblings, 0 replies; 7+ messages in thread
From: Javier Gonzalez @ 2017-06-28 15:19 UTC (permalink / raw)
  To: Javier Gonzalez; +Cc: Matias Bjørling, linux-block

[-- Attachment #1: Type: text/plain, Size: 3950 bytes --]

> On 28 Jun 2017, at 17.15, Javier Gonzalez <javier@cnexlabs.com> wrote:
> 
> On 28.06.2017 16:58, Javier Gonzalez wrote:
>>> On 28 Jun 2017, at 16.33, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>> 
>>> thanks for the pointer to the github reporting page.
>>> I'll answer your questions here (to make then indexable by search
>>> engines in case someone else stumbles upon this) and link to newly
>>> created github issues for the various problems I encountered.
>> Ok. I answered each issue directly on the github. A couple og things
>> inline though, for completion.
> 
> Thanks.
> 
> 
>>> On 28.06.2017 13:07, Javier Gonzalez wrote:
>>>> I'll take the question here, but please use our github [1] to report
>>>> errors and ask questions instead (including this thread). No need to
>>>> spam the rest of the linux-block mailing list for LightNVM specific
>>>> matters - unless of course, you want to discuss specific parts of the
>>>> code.
>>>> 
>>>> [1] https://github.com/OpenChannelSSD
>>>> 
>>>>> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>>>> 
>>>>> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
>>>>> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
>>>>> https://github.com/OpenChannelSSD/qemu-nvme .
>>>>> 
>>>>> I'm creating a pblk device inside the VM with the following command:
>>>>> [...]
>>>>> 
>>>>> This might either be a bug in the OpenChannelSSD qemu tree, or it might
>>>>> be a kernel bug.
>>>>> 
>>>>> I also got warnings like the below:
>>>> In the 4.12 patches for pblk we do not have an error state machine. This
>>>> is, when writes fail on the device (on qemu in this case), we did not
>>>> communicate this to the application. This bad error handling results in
>>>> unexpected side-errors like the one you are experiencing. On the patches
>>>> for 4.13, we have implemented the error state machine, so this type of
>>>> errors should be better handled.
>>> Oh. Shouldn't a minimal version of those patches get merged into 4.12
>>> (or 4.12-stable once 4.12 is released) to avoid releasing a kernel with
>>> a data corruption bug?
>> This is only in case the device fails, how we handle the error on the
>> host. If the device is not accepting writes for some reason, data is
>> lost anyway. So I don't think we need the fix for stable.
> 
> This is odd. AFAICS qemu isn't configured to simulate device failure, so
> in theory this should never have happened. Can you think of any reason
> why this code path was triggered? Should I open a separate github issue
> for that?

Qemu does not simulate failures (though we have implemented error
injection for LightNVM), but if there is an error on the device
simulation then it will propagate to the host. Based on what I can see,
seems like your qemu instance is doing something funky and that most of
your errors are related, but we need to look deeper into it.

In any case, let's continue the discussion on the open github issues.

> 
>>>> You can pick up the code from out github (linux.git - branch:
>>>> pblk.for-4.13) or take it directly form Jens' for-4.13/core
>>> Thanks. A full kernel compile will take some time, though. Do you happen
>>> to have a Ubuntu-compatible kernel .deb for the new code?
>> We thought about, but never actually did it (to share at least). I see
>> it might be useful :) For the time being, I'll share a minimal .config
>> for qemu, which takes a couple of minutes to compile.
> 
> Thanks!
> 
> 
>>> [various bugs]
>>> Filed as https://github.com/OpenChannelSSD/linux/issues/28
>>> Filed as https://github.com/OpenChannelSSD/linux/issues/29
>>> Filed as https://github.com/OpenChannelSSD/linux/issues/30
>>> Filed as https://github.com/OpenChannelSSD/linux/issues/31
>>> 
>>> Regards,
>>> Carl-Daniel
>> Javier
> 
> Regards,
> Carl-Daniel

Javier

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LightNVM pblk: read/write of random kernel memory
  2017-06-28 14:58     ` Javier Gonzalez
  2017-06-28 15:14       ` Carl-Daniel Hailfinger
       [not found]       ` <4F70E259-B8AB-4C99-9ABC-4D7B6F7169DF@cnexlabs.com>
@ 2017-06-30 14:05       ` Carl-Daniel Hailfinger
  2 siblings, 0 replies; 7+ messages in thread
From: Carl-Daniel Hailfinger @ 2017-06-30 14:05 UTC (permalink / raw)
  To: Javier Gonzalez; +Cc: Matias Bjørling, linux-block

On 28.06.2017 16:58, Javier Gonzalez wrote:
>> On 28 Jun 2017, at 16.33, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>
>> thanks for the pointer to the github reporting page.
>> I'll answer your questions here (to make then indexable by search
>> engines in case someone else stumbles upon this) and link to newly
>> created github issues for the various problems I encountered.
>>
> Ok. I answered each issue directly on the github. A couple og things
> inline though, for completion.
>
>> On 28.06.2017 13:07, Javier Gonzalez wrote:
>>> https://github.com/OpenChannelSSD
>>>
>>>> On 28 Jun 2017, at 01.30, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
>>>>
>>>> I'm currently having trouble with LightNVM pblk with kernel 4.12-rc7 on
>>>> Ubuntu 16.04.2 x86_64 in a Qemu VM using latest
>>>> https://github.com/OpenChannelSSD/qemu-nvme .
>>>>
>>>> Writing to the pblk device is only partially successful. I can see some
>>>> of the content which was written to the pblk device turn up in the
>>>> backing store file nvmebackingstore10G.nvme, but mostly the backing
>>>> store file contains random kernel memory from the VM. Reading back the
>>>> just written contents from the pblk device in the VM also yields random
>>>> kernel memory (or at least that's what I think that stuff is, i.e. lots
>>>> of strings present in various printk calls).
>>> Can you better define partially succesful?
>> Some of the contents written to the pblk device inside the vm end up
>> being written to the backing store, and some regions of the backing
>> store contain random kernel memory of the vm after a write. I am unable
>> to detect a pattern there, but random kernel memory should never be
>> written to disk in any case.
>>
>>
>>> Which workload are you
>>> running on top of the block device exposed by the pblk instance? Is it
>>> failing in any way?
>> I run fdisk on the instance to create a single partition with maximum
>> size, then
>> mkfs.ext4 /dev/mylightnvmdevice1
>> mount /dev/mylightnvmdevice1 /mnt
>> yes yes|head -n 4096 >/mnt/yes
>> umount /mnt
>>
>> Sometimes this results in an immediate hang during writing /mnt/yes,
>> sometimes it hangs on umount.
>> Filed as https://github.com/OpenChannelSSD/linux/issues/28
>>
>>
>> Inspecting the backing store sometimes yields the expected amount of
>> data written, sometimes parts of the backing store contain random vm
>> kernel memory. This random kernel memory can also be read from inside
>> the vm by hexdumping /dev/mylightnvmdevice .
>> Filed as https://github.com/OpenChannelSSD/linux/issues/30
>>
>>
>>>> qemu command line follows:
>>>> qemu-nvme.git/x86_64-softmmu/qemu-system-x86_64 -m 4096 -machine
>>>> q35,accel=kvm -vga qxl -spice port=5901,addr=127.0.0.1,disable-ticketing
>>>> -net nic,model=e1000 -net user -hda
>>>> /storage2/vmimages/usefulimages/ubuntu-16.04.2-server-kernel412rc6.qcow2
>>>> -drive
>>>> file=/storage2/vmimages/nvmebackingstore10G.nvme,if=none,id=mynvme
>>>> -device
>>>> nvme,drive=mynvme,serial=deadbeef,namespaces=1,lver=1,lmetasize=16,ll2pmode=0,nlbaf=5,lba_index=3,mdts=10,lnum_lun=1,
>>> As mentioned above, try several with several LUNs.
>>>
>>>> lnum_pln=2,lsec_size=4096,lsecs_per_pg=4,lpgs_per_blk=512,lbbtable=/storage2/vmimages/nvmebackingstore10G.bbtable,lmetadata=/storage2/vmimages/nvmebackingstore10G.meta,ldebug=1
>>>>
>>>> The backing store file was created with
>>>> truncate -s 10G /storage2/vmimages/nvmebackingstore10G.nvme
>>>>
>>>> This might either be a bug in the OpenChannelSSD qemu tree, or it might
>>>> be a kernel bug.
>>>>
>>>> I also got warnings like the below:
>>> In the 4.12 patches for pblk we do not have an error state machine. This
>>> is, when writes fail on the device (on qemu in this case), we did not
>>> communicate this to the application. This bad error handling results in
>>> unexpected side-errors like the one you are experiencing. On the patches
>>> for 4.13, we have implemented the error state machine, so this type of
>>> errors should be better handled.
>> Oh. Shouldn't a minimal version of those patches get merged into 4.12
>> (or 4.12-stable once 4.12 is released) to avoid releasing a kernel with
>> a data corruption bug?
> This is only in case the device fails, how we handle the error on the
> host. If the device is not accepting writes for some reason, data is
> lost anyway. So I don't think we need the fix for stable.
>
>
>>> You can pick up the code from out github (linux.git - branch:
>>> pblk.for-4.13) or take it directly form Jens' for-4.13/core

I can reproduce the hang in a few seconds just by writing 4096 MB to a
standard pblk device.
dd if=/dev/zero bs=1M count=4096 of=/dev/mypblkdevice
See also https://github.com/OpenChannelSSD/linux/issues/32

I can reproduce even with OpenChannelSSD linux.git branch pblk.for-4.13_v2 .

Any idea what to do next?

If it's really a qemu problem, does anyone have a working qemu command
line in combination with a way to create a backing store file which
works, and can you share that?

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-06-30 14:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-27 23:30 LightNVM pblk: read/write of random kernel memory Carl-Daniel Hailfinger
2017-06-28 11:07 ` Javier Gonzalez
2017-06-28 14:33   ` Carl-Daniel Hailfinger
2017-06-28 14:58     ` Javier Gonzalez
2017-06-28 15:14       ` Carl-Daniel Hailfinger
     [not found]       ` <4F70E259-B8AB-4C99-9ABC-4D7B6F7169DF@cnexlabs.com>
2017-06-28 15:19         ` Javier Gonzalez
2017-06-30 14:05       ` Carl-Daniel Hailfinger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.