All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-11 17:27 ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-11 17:27 UTC (permalink / raw)
  To: Linux Kernel; +Cc: xfs

BUG: sleeping function called from invalid context at mm/mempool.c:203
in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
5 locks held by trinity-c3/27511:
 #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
 #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
 #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
 #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
 #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
 0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
 ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
Call Trace:
 [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
 [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
 [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
 [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
 [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
 [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 [<ffffffff831cdef9>] path_openat+0xc9/0x620
 [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 [<ffffffff831bc554>] filp_open+0x44/0x60
 [<ffffffff83221862>] do_coredump+0x602/0xf60
 [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 [<ffffffff830024d7>] do_signal+0x57/0x9d0
 [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 [<ffffffff83730000>] ? __schedule+0x60/0x850
 [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 [<ffffffff83736746>] retint_signal+0x46/0x90
------------[ cut here ]------------
WARNING: CPU: 3 PID: 27511 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
Modules linked in: bridge stp tun snd_seq_dummy fuse hidp bnep rfcomm nf_conntrack_netlink nf_conntrack scsi_transport_iscsi nfnetlink can_raw nfc pn_pep caif_socket caif af_802154 phonet af_rxrpc can_bcm can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic serio_raw microcode pcspkr snd_hda_intel snd_hda_codec usb_debug btusb snd_hwdep snd_seq snd_seq_device bluetooth 6lowpan_iphc rfkill snd_pcm snd_timer snd soundcore e1000e ptp shpchp pps_core
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 0000000000000009 0000000045a3849a ffff88009f368f10 ffffffff8372afea
 0000000000000000 ffff88009f368f48 ffffffff8306d0cd ffff88023933ed00
 00000000ffffffff 00000000197f7510 0000000000800005 ffff880240913740
Call Trace:
 [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 [<ffffffff8306d0cd>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff8306d1fa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff832e81bf>] generic_make_request_checks+0x33f/0x460
 [<ffffffff832e8307>] generic_make_request+0x27/0x130
 [<ffffffff832e8488>] submit_bio+0x78/0x160
 [<ffffffff831f9000>] ? bio_alloc_bioset+0x180/0x210
 [<ffffffffc030832b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
 [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 [<ffffffff831cdef9>] path_openat+0xc9/0x620
 [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 [<ffffffff831bc554>] filp_open+0x44/0x60
 [<ffffffff83221862>] do_coredump+0x602/0xf60
 [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 [<ffffffff830024d7>] do_signal+0x57/0x9d0
 [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 [<ffffffff83730000>] ? __schedule+0x60/0x850
 [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 [<ffffffff83736746>] retint_signal+0x46/0x90
---[ end trace e6e48f6461c98e81 ]---


^ permalink raw reply	[flat|nested] 132+ messages in thread

* 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-11 17:27 ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-11 17:27 UTC (permalink / raw)
  To: Linux Kernel; +Cc: xfs

BUG: sleeping function called from invalid context at mm/mempool.c:203
in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
5 locks held by trinity-c3/27511:
 #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
 #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
 #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
 #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
 #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
 0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
 ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
Call Trace:
 [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
 [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
 [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
 [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
 [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
 [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 [<ffffffff831cdef9>] path_openat+0xc9/0x620
 [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 [<ffffffff831bc554>] filp_open+0x44/0x60
 [<ffffffff83221862>] do_coredump+0x602/0xf60
 [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 [<ffffffff830024d7>] do_signal+0x57/0x9d0
 [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 [<ffffffff83730000>] ? __schedule+0x60/0x850
 [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 [<ffffffff83736746>] retint_signal+0x46/0x90
------------[ cut here ]------------
WARNING: CPU: 3 PID: 27511 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
Modules linked in: bridge stp tun snd_seq_dummy fuse hidp bnep rfcomm nf_conntrack_netlink nf_conntrack scsi_transport_iscsi nfnetlink can_raw nfc pn_pep caif_socket caif af_802154 phonet af_rxrpc can_bcm can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic serio_raw microcode pcspkr snd_hda_intel snd_hda_codec usb_debug btusb snd_hwdep snd_seq snd_seq_device bluetooth 6lowpan_iphc rfkill snd_pcm snd_timer snd soundcore e1000e ptp shpchp pps_core
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 0000000000000009 0000000045a3849a ffff88009f368f10 ffffffff8372afea
 0000000000000000 ffff88009f368f48 ffffffff8306d0cd ffff88023933ed00
 00000000ffffffff 00000000197f7510 0000000000800005 ffff880240913740
Call Trace:
 [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 [<ffffffff8306d0cd>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff8306d1fa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff832e81bf>] generic_make_request_checks+0x33f/0x460
 [<ffffffff832e8307>] generic_make_request+0x27/0x130
 [<ffffffff832e8488>] submit_bio+0x78/0x160
 [<ffffffff831f9000>] ? bio_alloc_bioset+0x180/0x210
 [<ffffffffc030832b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
 [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 [<ffffffff831cdef9>] path_openat+0xc9/0x620
 [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 [<ffffffff831bc554>] filp_open+0x44/0x60
 [<ffffffff83221862>] do_coredump+0x602/0xf60
 [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 [<ffffffff830024d7>] do_signal+0x57/0x9d0
 [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 [<ffffffff83730000>] ? __schedule+0x60/0x850
 [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 [<ffffffff83736746>] retint_signal+0x46/0x90
---[ end trace e6e48f6461c98e81 ]---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-11 17:27 ` Dave Jones
@ 2014-02-11 21:08   ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-11 21:08 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
> BUG: sleeping function called from invalid context at mm/mempool.c:203
> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
> 5 locks held by trinity-c3/27511:
>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
> Call Trace:
>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
>  [<ffffffff831bc554>] filp_open+0x44/0x60
>  [<ffffffff83221862>] do_coredump+0x602/0xf60
>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
>  [<ffffffff83730000>] ? __schedule+0x60/0x850
>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
>  [<ffffffff83736746>] retint_signal+0x46/0x90

There's nowhere in this XFS stack that disables interrupts, so I
thinks it's either above or below XFS that is causing this problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-11 21:08   ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-11 21:08 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
> BUG: sleeping function called from invalid context at mm/mempool.c:203
> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
> 5 locks held by trinity-c3/27511:
>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
> Call Trace:
>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
>  [<ffffffff831bc554>] filp_open+0x44/0x60
>  [<ffffffff83221862>] do_coredump+0x602/0xf60
>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
>  [<ffffffff83730000>] ? __schedule+0x60/0x850
>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
>  [<ffffffff83736746>] retint_signal+0x46/0x90

There's nowhere in this XFS stack that disables interrupts, so I
thinks it's either above or below XFS that is causing this problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-11 21:08   ` Dave Chinner
@ 2014-02-11 21:49     ` Eric Sandeen
  -1 siblings, 0 replies; 132+ messages in thread
From: Eric Sandeen @ 2014-02-11 21:49 UTC (permalink / raw)
  To: Dave Chinner, Dave Jones, Linux Kernel, xfs

On 2/11/14, 3:08 PM, Dave Chinner wrote:
> On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
>> BUG: sleeping function called from invalid context at mm/mempool.c:203
>> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
>> 5 locks held by trinity-c3/27511:
>>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
>>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
>>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
>>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
>>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
>> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
>>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
>>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
>> Call Trace:
>>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
>>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
>>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
>>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
>>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
>>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
>>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
>>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
>>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
>>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
>>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
>>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
>>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
>>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
>>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
>>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
>>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
>>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
>>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
>>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
>>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
>>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
>>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
>>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
>>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
>>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
>>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
>>  [<ffffffff831bc554>] filp_open+0x44/0x60
>>  [<ffffffff83221862>] do_coredump+0x602/0xf60
>>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
>>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
>>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
>>  [<ffffffff83730000>] ? __schedule+0x60/0x850
>>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
>>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
>>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
>>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
>>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
>>  [<ffffffff83736746>] retint_signal+0x46/0x90
> 
> There's nowhere in this XFS stack that disables interrupts, so I
> thinks it's either above or below XFS that is causing this problem.

any chance the stack got blown & corrupted things such that it
only _looks_ like irqs are disabled?

-Eric

> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-11 21:49     ` Eric Sandeen
  0 siblings, 0 replies; 132+ messages in thread
From: Eric Sandeen @ 2014-02-11 21:49 UTC (permalink / raw)
  To: Dave Chinner, Dave Jones, Linux Kernel, xfs

On 2/11/14, 3:08 PM, Dave Chinner wrote:
> On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
>> BUG: sleeping function called from invalid context at mm/mempool.c:203
>> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
>> 5 locks held by trinity-c3/27511:
>>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
>>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
>>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
>>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
>>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
>> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
>>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
>>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
>> Call Trace:
>>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
>>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
>>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
>>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
>>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
>>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
>>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
>>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
>>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
>>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
>>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
>>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
>>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
>>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
>>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
>>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
>>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
>>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
>>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
>>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
>>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
>>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
>>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
>>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
>>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
>>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
>>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
>>  [<ffffffff831bc554>] filp_open+0x44/0x60
>>  [<ffffffff83221862>] do_coredump+0x602/0xf60
>>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
>>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
>>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
>>  [<ffffffff83730000>] ? __schedule+0x60/0x850
>>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
>>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
>>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
>>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
>>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
>>  [<ffffffff83736746>] retint_signal+0x46/0x90
> 
> There's nowhere in this XFS stack that disables interrupts, so I
> thinks it's either above or below XFS that is causing this problem.

any chance the stack got blown & corrupted things such that it
only _looks_ like irqs are disabled?

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-11 21:49     ` Eric Sandeen
@ 2014-02-12  0:44       ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  0:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Linux Kernel, xfs, Al Viro, Linus Torvalds

On Tue, Feb 11, 2014 at 03:49:14PM -0600, Eric Sandeen wrote:
 > On 2/11/14, 3:08 PM, Dave Chinner wrote:
 > > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
 > >> BUG: sleeping function called from invalid context at mm/mempool.c:203
 > >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
 > >> 5 locks held by trinity-c3/27511:
 > >>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
 > >>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
 > >>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
 > >>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
 > >>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
 > >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 > >>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
 > >>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
 > >>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
 > >> Call Trace:
 > >>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 > >>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
 > >>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
 > >>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
 > >>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
 > >>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
 > >>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 > >>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 > >>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 > >>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 > >>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 > >>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 > >>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 > >>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 > >>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 > >>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 > >>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 > >>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 > >>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 > >>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 > >>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 > >>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 > >>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 > >>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 > >>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 > >>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 > >>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 > >>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 > >>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 > >>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 > >>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 > >>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 > >>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 > >>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 > >>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 > >>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 > >>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
 > >>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 > >>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 > >>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 > >>  [<ffffffff831bc554>] filp_open+0x44/0x60
 > >>  [<ffffffff83221862>] do_coredump+0x602/0xf60
 > >>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 > >>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
 > >>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 > >>  [<ffffffff83730000>] ? __schedule+0x60/0x850
 > >>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 > >>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 > >>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 > >>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 > >>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 > >>  [<ffffffff83736746>] retint_signal+0x46/0x90
 > > 
 > > There's nowhere in this XFS stack that disables interrupts, so I
 > > thinks it's either above or below XFS that is causing this problem.
 > 
 > any chance the stack got blown & corrupted things such that it
 > only _looks_ like irqs are disabled?

The 'good' news is I seem to be able to reproduce it fairly quickly.
I've seen it 3-4 times this afternoon. (Always from the coredump path)

I was on vacation last week, but I wasn't hitting this before then, so it's a
fairly recent change that's introduced this.

Al ?

	Dave


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  0:44       ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  0:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Linus Torvalds, Linux Kernel, Al Viro, xfs

On Tue, Feb 11, 2014 at 03:49:14PM -0600, Eric Sandeen wrote:
 > On 2/11/14, 3:08 PM, Dave Chinner wrote:
 > > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
 > >> BUG: sleeping function called from invalid context at mm/mempool.c:203
 > >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
 > >> 5 locks held by trinity-c3/27511:
 > >>  #0:  (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
 > >>  #1:  (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
 > >>  #2:  (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
 > >>  #3:  (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
 > >>  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
 > >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 > >>  ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
 > >>  0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
 > >>  ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
 > >> Call Trace:
 > >>  [<ffffffff8372afea>] dump_stack+0x4e/0x7a
 > >>  [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
 > >>  [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
 > >>  [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
 > >>  [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
 > >>  [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
 > >>  [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 > >>  [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
 > >>  [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
 > >>  [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
 > >>  [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
 > >>  [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 > >>  [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 > >>  [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
 > >>  [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
 > >>  [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
 > >>  [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
 > >>  [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 > >>  [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 > >>  [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
 > >>  [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 > >>  [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 > >>  [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 > >>  [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 > >>  [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 > >>  [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
 > >>  [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 > >>  [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 > >>  [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 > >>  [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
 > >>  [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
 > >>  [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 > >>  [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
 > >>  [<ffffffff831ccc15>] vfs_create+0x95/0xc0
 > >>  [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
 > >>  [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
 > >>  [<ffffffff831cdef9>] path_openat+0xc9/0x620
 > >>  [<ffffffff8331cc02>] ? put_dec+0x72/0x90
 > >>  [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
 > >>  [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
 > >>  [<ffffffff831bc554>] filp_open+0x44/0x60
 > >>  [<ffffffff83221862>] do_coredump+0x602/0xf60
 > >>  [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
 > >>  [<ffffffff830024d7>] do_signal+0x57/0x9d0
 > >>  [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
 > >>  [<ffffffff83730000>] ? __schedule+0x60/0x850
 > >>  [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
 > >>  [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
 > >>  [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
 > >>  [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
 > >>  [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
 > >>  [<ffffffff83736746>] retint_signal+0x46/0x90
 > > 
 > > There's nowhere in this XFS stack that disables interrupts, so I
 > > thinks it's either above or below XFS that is causing this problem.
 > 
 > any chance the stack got blown & corrupted things such that it
 > only _looks_ like irqs are disabled?

The 'good' news is I seem to be able to reproduce it fairly quickly.
I've seen it 3-4 times this afternoon. (Always from the coredump path)

I was on vacation last week, but I wasn't hitting this before then, so it's a
fairly recent change that's introduced this.

Al ?

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  0:44       ` Dave Jones
@ 2014-02-12  1:09         ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12  1:09 UTC (permalink / raw)
  To: Dave Jones, Eric Sandeen, Dave Chinner, Linux Kernel, xfs,
	Linus Torvalds

On Tue, Feb 11, 2014 at 07:44:03PM -0500, Dave Jones wrote:

> The 'good' news is I seem to be able to reproduce it fairly quickly.
> I've seen it 3-4 times this afternoon. (Always from the coredump path)
> 
> I was on vacation last week, but I wasn't hitting this before then, so it's a
> fairly recent change that's introduced this.

Slap the check in vfs_create(), see if interrupts had been disabled by it or
by something in ->create().  Since it's reproducible...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  1:09         ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12  1:09 UTC (permalink / raw)
  To: Dave Jones, Eric Sandeen, Dave Chinner, Linux Kernel, xfs,
	Linus Torvalds

On Tue, Feb 11, 2014 at 07:44:03PM -0500, Dave Jones wrote:

> The 'good' news is I seem to be able to reproduce it fairly quickly.
> I've seen it 3-4 times this afternoon. (Always from the coredump path)
> 
> I was on vacation last week, but I wasn't hitting this before then, so it's a
> fairly recent change that's introduced this.

Slap the check in vfs_create(), see if interrupts had been disabled by it or
by something in ->create().  Since it's reproducible...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  1:09         ` Al Viro
@ 2014-02-12  2:52           ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  2:52 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Eric Sandeen, Dave Chinner, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Slap the check in vfs_create(), see if interrupts had been disabled by it or
> by something in ->create().  Since it's reproducible...

path_openat() starts off with a get_empty_filp(), which allocates a
file pointer with GFP_KERNEL. So that should have triggered the
might_sleep warning if irq's were already disabled at that point.

So it's not before that - in particular, it's not in the signal
handling or do_coredump() paths.

Also, at least xfs_buf_lock() - which is much deeper in that chain -
does a down(&bp->b_sema). I'm disguested that that doesn't have a
might_sleep() in it.

Dave, mind adding a "might_sleep()" to the top of
"down[_interruptible]()". It's silly to not have coverage of semaphore
use in bad contexts.

             Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  2:52           ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  2:52 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Slap the check in vfs_create(), see if interrupts had been disabled by it or
> by something in ->create().  Since it's reproducible...

path_openat() starts off with a get_empty_filp(), which allocates a
file pointer with GFP_KERNEL. So that should have triggered the
might_sleep warning if irq's were already disabled at that point.

So it's not before that - in particular, it's not in the signal
handling or do_coredump() paths.

Also, at least xfs_buf_lock() - which is much deeper in that chain -
does a down(&bp->b_sema). I'm disguested that that doesn't have a
might_sleep() in it.

Dave, mind adding a "might_sleep()" to the top of
"down[_interruptible]()". It's silly to not have coverage of semaphore
use in bad contexts.

             Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  2:52           ` Linus Torvalds
@ 2014-02-12  4:03             ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  4:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Eric Sandeen, Dave Chinner, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 06:52:19PM -0800, Linus Torvalds wrote:
 > On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
 > >
 > > Slap the check in vfs_create(), see if interrupts had been disabled by it or
 > > by something in ->create().  Since it's reproducible...
 > 
 > path_openat() starts off with a get_empty_filp(), which allocates a
 > file pointer with GFP_KERNEL. So that should have triggered the
 > might_sleep warning if irq's were already disabled at that point.
 > 
 > So it's not before that - in particular, it's not in the signal
 > handling or do_coredump() paths.
 > 
 > Also, at least xfs_buf_lock() - which is much deeper in that chain -
 > does a down(&bp->b_sema). I'm disguested that that doesn't have a
 > might_sleep() in it.
 > 
 > Dave, mind adding a "might_sleep()" to the top of
 > "down[_interruptible]()". It's silly to not have coverage of semaphore
 > use in bad contexts.

Added those, didn't trigger. Neither did Al's suggestion.

Slightly different trace, but still from the coredump path.

	Dave

[ 3111.403822] BUG: sleeping function called from invalid context at mm/mempool.c:203
[ 3111.404414] in_atomic(): 0, irqs_disabled(): 1, pid: 19213, name: trinity-c46
[ 3111.404884] 5 locks held by trinity-c46/19213:
[ 3111.405354]  #0:  (sb_writers#9){......}, at: [<ffffffff8d2220c5>] do_coredump+0xe05/0xf60
[ 3111.405862]  #1:  (shrinker_rwsem){......}, at: [<ffffffff8d1644af>] shrink_slab+0x3f/0x180
[ 3111.406374]  #2:  (&type->s_umount_key#30){......}, at: [<ffffffff8d1c0db4>] grab_super_passive+0x44/0x90
[ 3111.406905]  #3:  (&pag->pag_ici_reclaim_lock){......}, at: [<ffffffffc031eb1a>] xfs_reclaim_inodes_ag+0x31a/0x430 [xfs]
[ 3111.407466]  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc037047f>] xfs_ilock+0x16f/0x1b0 [xfs]
[ 3111.408039] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 
[ 3111.409779]  ffffffff8da3fde6 000000004f998871 ffff88023715cce0 ffffffff8d72b091
[ 3111.410396]  0000000000000000 ffff88023715cd08 ffffffff8d09ddb5 0000000000000010
[ 3111.411021]  ffff880243566288 0000000000000008 ffff88023715cd88 ffffffff8d1534f3
[ 3111.411656] Call Trace:
[ 3111.412283]  [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.412922]  [<ffffffff8d09ddb5>] __might_sleep+0x105/0x150
[ 3111.413562]  [<ffffffff8d1534f3>] mempool_alloc+0xa3/0x170
[ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
[ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
[ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.420195]  [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.420865]  [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.421551]  [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.422239]  [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.422932]  [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.423625]  [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.424310]  [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.425008]  [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.425705]  [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.426405]  [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.427104]  [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.427797]  [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.428481]  [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.429150]  [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.429824]  [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.430489]  [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.431146]  [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.431796]  [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.432436]  [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.433065]  [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.433689]  [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.434304]  [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.434911]  [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.435510]  [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.436098]  [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.436681]  [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.437254]  [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.437810]  [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.438356]  [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.438884]  [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.439398]  [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.439898]  [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.440386]  [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.440873]  [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.441357]  [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.441833]  [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.442307]  [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.442783]  [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.443260]  [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.443773] ------------[ cut here ]------------
[ 3111.444248] WARNING: CPU: 0 PID: 19213 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
[ 3111.444742] Modules linked in: fuse can_bcm can_raw scsi_transport_iscsi ipt_ULOG nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic microcode serio_raw pcspkr btusb bluetooth 6lowpan_iphc rfkill usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e ptp pps_core shpchp snd_timer snd soundcore
[ 3111.447600] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 
[ 3111.449427]  0000000000000009 000000004f998871 ffff88023715cc90 ffffffff8d72b091
[ 3111.450062]  0000000000000000 ffff88023715ccc8 ffffffff8d06d0cd ffff8801ef3a1440
[ 3111.450697]  00000000ffffffff 00000000197e5088 0000000000800005 ffff88024091b740
[ 3111.451339] Call Trace:
[ 3111.451968]  [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.452608]  [<ffffffff8d06d0cd>] warn_slowpath_common+0x7d/0xa0
[ 3111.453247]  [<ffffffff8d06d1fa>] warn_slowpath_null+0x1a/0x20
[ 3111.453882]  [<ffffffff8d2e821f>] generic_make_request_checks+0x33f/0x460
[ 3111.454523]  [<ffffffff8d2e8367>] generic_make_request+0x27/0x130
[ 3111.455159]  [<ffffffff8d2e84e8>] submit_bio+0x78/0x160
[ 3111.455790]  [<ffffffff8d1f9060>] ? bio_alloc_bioset+0x180/0x210
[ 3111.456432]  [<ffffffffc031432b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
[ 3111.457179]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.457825]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.458472]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.459120]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.459768]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.460422]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.461069]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.461714]  [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.462349]  [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.463001]  [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.463654]  [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.464311]  [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.464966]  [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.465609]  [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.466266]  [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.466920]  [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.467573]  [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.468226]  [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.468881]  [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.469535]  [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.470182]  [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.470831]  [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.471478]  [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.472121]  [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.472761]  [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.473401]  [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.474040]  [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.474670]  [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.475285]  [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.475879]  [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.476455]  [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.477012]  [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.477549]  [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.478072]  [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.478576]  [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.479065]  [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.479546]  [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.480021]  [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.480498]  [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.480964]  [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.481435]  [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.481906]  [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.482377]  [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.482847]  [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.483324]  [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.483800]  [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.484274] ---[ end trace 8cc272739908f779 ]---



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  4:03             ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  4:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: xfs, Eric Sandeen, Al Viro, Linux Kernel

On Tue, Feb 11, 2014 at 06:52:19PM -0800, Linus Torvalds wrote:
 > On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
 > >
 > > Slap the check in vfs_create(), see if interrupts had been disabled by it or
 > > by something in ->create().  Since it's reproducible...
 > 
 > path_openat() starts off with a get_empty_filp(), which allocates a
 > file pointer with GFP_KERNEL. So that should have triggered the
 > might_sleep warning if irq's were already disabled at that point.
 > 
 > So it's not before that - in particular, it's not in the signal
 > handling or do_coredump() paths.
 > 
 > Also, at least xfs_buf_lock() - which is much deeper in that chain -
 > does a down(&bp->b_sema). I'm disguested that that doesn't have a
 > might_sleep() in it.
 > 
 > Dave, mind adding a "might_sleep()" to the top of
 > "down[_interruptible]()". It's silly to not have coverage of semaphore
 > use in bad contexts.

Added those, didn't trigger. Neither did Al's suggestion.

Slightly different trace, but still from the coredump path.

	Dave

[ 3111.403822] BUG: sleeping function called from invalid context at mm/mempool.c:203
[ 3111.404414] in_atomic(): 0, irqs_disabled(): 1, pid: 19213, name: trinity-c46
[ 3111.404884] 5 locks held by trinity-c46/19213:
[ 3111.405354]  #0:  (sb_writers#9){......}, at: [<ffffffff8d2220c5>] do_coredump+0xe05/0xf60
[ 3111.405862]  #1:  (shrinker_rwsem){......}, at: [<ffffffff8d1644af>] shrink_slab+0x3f/0x180
[ 3111.406374]  #2:  (&type->s_umount_key#30){......}, at: [<ffffffff8d1c0db4>] grab_super_passive+0x44/0x90
[ 3111.406905]  #3:  (&pag->pag_ici_reclaim_lock){......}, at: [<ffffffffc031eb1a>] xfs_reclaim_inodes_ag+0x31a/0x430 [xfs]
[ 3111.407466]  #4:  (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc037047f>] xfs_ilock+0x16f/0x1b0 [xfs]
[ 3111.408039] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 
[ 3111.409779]  ffffffff8da3fde6 000000004f998871 ffff88023715cce0 ffffffff8d72b091
[ 3111.410396]  0000000000000000 ffff88023715cd08 ffffffff8d09ddb5 0000000000000010
[ 3111.411021]  ffff880243566288 0000000000000008 ffff88023715cd88 ffffffff8d1534f3
[ 3111.411656] Call Trace:
[ 3111.412283]  [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.412922]  [<ffffffff8d09ddb5>] __might_sleep+0x105/0x150
[ 3111.413562]  [<ffffffff8d1534f3>] mempool_alloc+0xa3/0x170
[ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
[ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
[ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.420195]  [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.420865]  [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.421551]  [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.422239]  [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.422932]  [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.423625]  [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.424310]  [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.425008]  [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.425705]  [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.426405]  [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.427104]  [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.427797]  [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.428481]  [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.429150]  [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.429824]  [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.430489]  [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.431146]  [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.431796]  [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.432436]  [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.433065]  [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.433689]  [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.434304]  [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.434911]  [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.435510]  [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.436098]  [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.436681]  [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.437254]  [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.437810]  [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.438356]  [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.438884]  [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.439398]  [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.439898]  [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.440386]  [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.440873]  [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.441357]  [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.441833]  [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.442307]  [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.442783]  [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.443260]  [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.443773] ------------[ cut here ]------------
[ 3111.444248] WARNING: CPU: 0 PID: 19213 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
[ 3111.444742] Modules linked in: fuse can_bcm can_raw scsi_transport_iscsi ipt_ULOG nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic microcode serio_raw pcspkr btusb bluetooth 6lowpan_iphc rfkill usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e ptp pps_core shpchp snd_timer snd soundcore
[ 3111.447600] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 
[ 3111.449427]  0000000000000009 000000004f998871 ffff88023715cc90 ffffffff8d72b091
[ 3111.450062]  0000000000000000 ffff88023715ccc8 ffffffff8d06d0cd ffff8801ef3a1440
[ 3111.450697]  00000000ffffffff 00000000197e5088 0000000000800005 ffff88024091b740
[ 3111.451339] Call Trace:
[ 3111.451968]  [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.452608]  [<ffffffff8d06d0cd>] warn_slowpath_common+0x7d/0xa0
[ 3111.453247]  [<ffffffff8d06d1fa>] warn_slowpath_null+0x1a/0x20
[ 3111.453882]  [<ffffffff8d2e821f>] generic_make_request_checks+0x33f/0x460
[ 3111.454523]  [<ffffffff8d2e8367>] generic_make_request+0x27/0x130
[ 3111.455159]  [<ffffffff8d2e84e8>] submit_bio+0x78/0x160
[ 3111.455790]  [<ffffffff8d1f9060>] ? bio_alloc_bioset+0x180/0x210
[ 3111.456432]  [<ffffffffc031432b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
[ 3111.457179]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.457825]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.458472]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.459120]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.459768]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.460422]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.461069]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.461714]  [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.462349]  [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.463001]  [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.463654]  [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.464311]  [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.464966]  [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.465609]  [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.466266]  [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.466920]  [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.467573]  [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.468226]  [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.468881]  [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.469535]  [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.470182]  [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.470831]  [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.471478]  [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.472121]  [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.472761]  [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.473401]  [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.474040]  [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.474670]  [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.475285]  [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.475879]  [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.476455]  [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.477012]  [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.477549]  [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.478072]  [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.478576]  [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.479065]  [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.479546]  [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.480021]  [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.480498]  [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.480964]  [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.481435]  [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.481906]  [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.482377]  [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.482847]  [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.483324]  [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.483800]  [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.484274] ---[ end trace 8cc272739908f779 ]---


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  4:03             ` Dave Jones
@ 2014-02-12  4:22               ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12  4:22 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Eric Sandeen, Dave Chinner,
	Linux Kernel, xfs

On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> [ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> [ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> [ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> [ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> [ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> [ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> [ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]

Very interesting.  The first thing xlog_cil_push() is doing is blocking
kmalloc().  So at that point it still hadn't been atomic.  I'd probably
slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  4:22               ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12  4:22 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Eric Sandeen, Dave Chinner,
	Linux Kernel, xfs

On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> [ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> [ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> [ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> [ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> [ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> [ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> [ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]

Very interesting.  The first thing xlog_cil_push() is doing is blocking
kmalloc().  So at that point it still hadn't been atomic.  I'd probably
slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  4:22               ` Al Viro
@ 2014-02-12  5:40                 ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  5:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 04:22:15AM +0000, Al Viro wrote:
> On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> > [ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> > [ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> > [ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> > [ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> > [ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> > [ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> > [ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
> 
> Very interesting.  The first thing xlog_cil_push() is doing is blocking
> kmalloc().  So at that point it still hadn't been atomic.  I'd probably
> slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

None of the XFS code disables interrupts in that path, not does is
call outside XFS except to dispatch IO. The stack is pretty deep at
this point and I know that the standard (non stacked) IO stack can
consume >3kb of stack space when it gets down to having to do memory
reclaim during GFP_NOIO allocation at the lowest level of SCSI
drivers. Stack overruns typically show up with symptoms like we are
seeing.

Simple example with memory allocation follows. keep in mind that
memory reclaim uses a whole lot more stack if it is needed, and that
scheduling at this point requires about 1k of stack to be free for
the scheduler footprint, too.

FWIW, the blk-mq stuff seems to hae added 200-300 bytes of new stack
usage to the IO path....

$ sudo cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (45 entries)
        -----    ----   --------
  0)     5944      40   zone_statistics+0xbd/0xc0
  1)     5904     256   get_page_from_freelist+0x3a8/0x8a0
  2)     5648     256   __alloc_pages_nodemask+0x143/0x8e0
  3)     5392      80   alloc_pages_current+0xb2/0x170
  4)     5312      64   new_slab+0x265/0x2e0
  5)     5248     240   __slab_alloc+0x2fb/0x4c4
  6)     5008      80   __kmalloc+0x133/0x180
  7)     4928     112   virtqueue_add_sgs+0x2fe/0x520
  8)     4816     288   __virtblk_add_req+0xd5/0x180
  9)     4528      96   virtio_queue_rq+0xdd/0x1d0
 10)     4432     112   __blk_mq_run_hw_queue+0x1c3/0x3c0
 11)     4320      16   blk_mq_run_hw_queue+0x35/0x40
 12)     4304      80   blk_mq_insert_requests+0xc5/0x120
 13)     4224      96   blk_mq_flush_plug_list+0x129/0x140
 14)     4128     112   blk_flush_plug_list+0xe7/0x240
 15)     4016      32   blk_finish_plug+0x18/0x50
 16)     3984     192   _xfs_buf_ioapply+0x30f/0x3b0
 17)     3792      48   xfs_buf_iorequest+0x6f/0xc0
....
 37)      928      16   xfs_vn_create+0x13/0x20
 38)      912      64   vfs_create+0xb5/0xf0
 39)      848     208   do_last.isra.53+0x6e0/0xd00
 40)      640     176   path_openat+0xbe/0x620
 41)      464     208   do_filp_open+0x43/0xa0
 42)      256     112   do_sys_open+0x13c/0x230
 43)      144      16   SyS_open+0x22/0x30
 44)      128     128   system_call_fastpath+0x16/0x1b


Dave, before chasing ghosts, can you (like Eric originally asked)
turn on stack overrun detection?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  5:40                 ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  5:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Eric Sandeen, Linus Torvalds, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 04:22:15AM +0000, Al Viro wrote:
> On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> > [ 3111.414202]  [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> > [ 3111.414855]  [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> > [ 3111.415517]  [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.416175]  [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> > [ 3111.416843]  [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.417509]  [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> > [ 3111.418175]  [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> > [ 3111.418846]  [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> > [ 3111.419518]  [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
> 
> Very interesting.  The first thing xlog_cil_push() is doing is blocking
> kmalloc().  So at that point it still hadn't been atomic.  I'd probably
> slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

None of the XFS code disables interrupts in that path, not does is
call outside XFS except to dispatch IO. The stack is pretty deep at
this point and I know that the standard (non stacked) IO stack can
consume >3kb of stack space when it gets down to having to do memory
reclaim during GFP_NOIO allocation at the lowest level of SCSI
drivers. Stack overruns typically show up with symptoms like we are
seeing.

Simple example with memory allocation follows. keep in mind that
memory reclaim uses a whole lot more stack if it is needed, and that
scheduling at this point requires about 1k of stack to be free for
the scheduler footprint, too.

FWIW, the blk-mq stuff seems to hae added 200-300 bytes of new stack
usage to the IO path....

$ sudo cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (45 entries)
        -----    ----   --------
  0)     5944      40   zone_statistics+0xbd/0xc0
  1)     5904     256   get_page_from_freelist+0x3a8/0x8a0
  2)     5648     256   __alloc_pages_nodemask+0x143/0x8e0
  3)     5392      80   alloc_pages_current+0xb2/0x170
  4)     5312      64   new_slab+0x265/0x2e0
  5)     5248     240   __slab_alloc+0x2fb/0x4c4
  6)     5008      80   __kmalloc+0x133/0x180
  7)     4928     112   virtqueue_add_sgs+0x2fe/0x520
  8)     4816     288   __virtblk_add_req+0xd5/0x180
  9)     4528      96   virtio_queue_rq+0xdd/0x1d0
 10)     4432     112   __blk_mq_run_hw_queue+0x1c3/0x3c0
 11)     4320      16   blk_mq_run_hw_queue+0x35/0x40
 12)     4304      80   blk_mq_insert_requests+0xc5/0x120
 13)     4224      96   blk_mq_flush_plug_list+0x129/0x140
 14)     4128     112   blk_flush_plug_list+0xe7/0x240
 15)     4016      32   blk_finish_plug+0x18/0x50
 16)     3984     192   _xfs_buf_ioapply+0x30f/0x3b0
 17)     3792      48   xfs_buf_iorequest+0x6f/0xc0
....
 37)      928      16   xfs_vn_create+0x13/0x20
 38)      912      64   vfs_create+0xb5/0xf0
 39)      848     208   do_last.isra.53+0x6e0/0xd00
 40)      640     176   path_openat+0xbe/0x620
 41)      464     208   do_filp_open+0x43/0xa0
 42)      256     112   do_sys_open+0x13c/0x230
 43)      144      16   SyS_open+0x22/0x30
 44)      128     128   system_call_fastpath+0x16/0x1b


Dave, before chasing ghosts, can you (like Eric originally asked)
turn on stack overrun detection?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  5:40                 ` Dave Chinner
@ 2014-02-12  5:50                   ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  5:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:

 > None of the XFS code disables interrupts in that path, not does is
 > call outside XFS except to dispatch IO. The stack is pretty deep at
 > this point and I know that the standard (non stacked) IO stack can
 > consume >3kb of stack space when it gets down to having to do memory
 > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > drivers. Stack overruns typically show up with symptoms like we are
 > seeing.
 > ..
 > 
 > Dave, before chasing ghosts, can you (like Eric originally asked)
 > turn on stack overrun detection?

CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

	Dave


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  5:50                   ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12  5:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs, Linus Torvalds, Al Viro, Linux Kernel

On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:

 > None of the XFS code disables interrupts in that path, not does is
 > call outside XFS except to dispatch IO. The stack is pretty deep at
 > this point and I know that the standard (non stacked) IO stack can
 > consume >3kb of stack space when it gets down to having to do memory
 > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > drivers. Stack overruns typically show up with symptoms like we are
 > seeing.
 > ..
 > 
 > Dave, before chasing ghosts, can you (like Eric originally asked)
 > turn on stack overrun detection?

CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  5:50                   ` Dave Jones
@ 2014-02-12  6:10                     ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  6:10 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> 
>  > None of the XFS code disables interrupts in that path, not does is
>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > this point and I know that the standard (non stacked) IO stack can
>  > consume >3kb of stack space when it gets down to having to do memory
>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > drivers. Stack overruns typically show up with symptoms like we are
>  > seeing.
>  > ..
>  > 
>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > turn on stack overrun detection?
> 
> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

That only checks stack usage when an interrupt is taken. If no
interrupts are taken when stack usage is within 128 bytes of
overflow, then it doesn't catch it.

I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
stack usage of a process via canary overwrites and it records it in
do_exit(). I also use the stack tracer to record the largest stack
usage seen so I know exactly what code paths are approaching stack
overruns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  6:10                     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  6:10 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> 
>  > None of the XFS code disables interrupts in that path, not does is
>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > this point and I know that the standard (non stacked) IO stack can
>  > consume >3kb of stack space when it gets down to having to do memory
>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > drivers. Stack overruns typically show up with symptoms like we are
>  > seeing.
>  > ..
>  > 
>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > turn on stack overrun detection?
> 
> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

That only checks stack usage when an interrupt is taken. If no
interrupts are taken when stack usage is within 128 bytes of
overflow, then it doesn't catch it.

I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
stack usage of a process via canary overwrites and it records it in
do_exit(). I also use the stack tracer to record the largest stack
usage seen so I know exactly what code paths are approaching stack
overruns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  5:40                 ` Dave Chinner
@ 2014-02-12  6:28                   ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  6:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> None of the XFS code disables interrupts in that path, not does is
> call outside XFS except to dispatch IO. The stack is pretty deep at
> this point and I know that the standard (non stacked) IO stack can
> consume >3kb of stack space when it gets down to having to do memory
> reclaim during GFP_NOIO allocation at the lowest level of SCSI
> drivers. Stack overruns typically show up with symptoms like we are
> seeing.

That would also explain why it shows up for do_coredump(), even though
clearly interrupts are not disabled at that point. It's just because
do_coredump() opens a filename at a deeper point in the stack than the
more normal system call paths do.

It looks like just "do_signal()" has a stack frame that is about 230
bytes even under normal circumstancs (largely due to "struct ksignal"
- which in turn is largely due to the insane 128-byte padding in
siginfo_t). Add a few other frames in there, and I guess that if it
was close before, the coredump path just makes it go off.

And some of the debug options that I'm sure DaveJ has enabled tend to
make the stack usage worse (simply because they make a lot of data
structures bigger).

             Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  6:28                   ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  6:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, xfs, Eric Sandeen, Al Viro, Linux Kernel

On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> None of the XFS code disables interrupts in that path, not does is
> call outside XFS except to dispatch IO. The stack is pretty deep at
> this point and I know that the standard (non stacked) IO stack can
> consume >3kb of stack space when it gets down to having to do memory
> reclaim during GFP_NOIO allocation at the lowest level of SCSI
> drivers. Stack overruns typically show up with symptoms like we are
> seeing.

That would also explain why it shows up for do_coredump(), even though
clearly interrupts are not disabled at that point. It's just because
do_coredump() opens a filename at a deeper point in the stack than the
more normal system call paths do.

It looks like just "do_signal()" has a stack frame that is about 230
bytes even under normal circumstancs (largely due to "struct ksignal"
- which in turn is largely due to the insane 128-byte padding in
siginfo_t). Add a few other frames in there, and I guess that if it
was close before, the coredump path just makes it go off.

And some of the debug options that I'm sure DaveJ has enabled tend to
make the stack usage worse (simply because they make a lot of data
structures bigger).

             Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:10                     ` Dave Chinner
@ 2014-02-12  6:31                       ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  6:31 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> > 
> >  > None of the XFS code disables interrupts in that path, not does is
> >  > call outside XFS except to dispatch IO. The stack is pretty deep at
> >  > this point and I know that the standard (non stacked) IO stack can
> >  > consume >3kb of stack space when it gets down to having to do memory
> >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> >  > drivers. Stack overruns typically show up with symptoms like we are
> >  > seeing.
> >  > ..
> >  > 
> >  > Dave, before chasing ghosts, can you (like Eric originally asked)
> >  > turn on stack overrun detection?
> > 
> > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...

FYI, just creating lots of files with open(O_CREAT):

[  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
[  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
[  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
[  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
[  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
[  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
[  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left

We've got absolutely no spare stack space anymore in the IO path.
And the IO path can't get much simpler than filesystem -> virtio
block device.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  6:31                       ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  6:31 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> > 
> >  > None of the XFS code disables interrupts in that path, not does is
> >  > call outside XFS except to dispatch IO. The stack is pretty deep at
> >  > this point and I know that the standard (non stacked) IO stack can
> >  > consume >3kb of stack space when it gets down to having to do memory
> >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> >  > drivers. Stack overruns typically show up with symptoms like we are
> >  > seeing.
> >  > ..
> >  > 
> >  > Dave, before chasing ghosts, can you (like Eric originally asked)
> >  > turn on stack overrun detection?
> > 
> > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...

FYI, just creating lots of files with open(O_CREAT):

[  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
[  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
[  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
[  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
[  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
[  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
[  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left

We've got absolutely no spare stack space anymore in the IO path.
And the IO path can't get much simpler than filesystem -> virtio
block device.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:31                       ` Dave Chinner
@ 2014-02-12  6:59                         ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  6:59 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe, Tejun Heo, Steven Rostedt
  Cc: Dave Jones, Al Viro, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> FYI, just creating lots of files with open(O_CREAT):
>
> [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
>
> We've got absolutely no spare stack space anymore in the IO path.
> And the IO path can't get much simpler than filesystem -> virtio
> block device.

Ugh, that's bad. A thousand bytes of stack space is much too close to
any limits.

Do you have the stack traces for these things so that we can look at
worst offenders?

If the new block-mq code is to blame, it needs to be fixed.
__virtblk_add_req() has a 300-byte stack frame, it seems. Looking
elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
hopefully not in any normal path - online discard is moronic, and I'm
assuming XFS doesn't do that.

There's a lot of 200+ byte stack frames in block/blk-core.s, and they
all seem to be of the type perf_trace_block_buffer() - things created
with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
frame, I have no idea. That sounds like a potential disaster too,
although hopefully it's mostly leaf functions - but leaf functions
*deep* in the callchain. Tejun? Steven, why _do_ they end up with such
huge frames?

And if the stack use comes from the VFS layer, we can probably work on
that too. But I don't think that has really changed much lately..

                Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  6:59                         ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12  6:59 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe, Tejun Heo, Steven Rostedt
  Cc: Dave Jones, xfs, Eric Sandeen, Al Viro, Linux Kernel

On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> FYI, just creating lots of files with open(O_CREAT):
>
> [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
>
> We've got absolutely no spare stack space anymore in the IO path.
> And the IO path can't get much simpler than filesystem -> virtio
> block device.

Ugh, that's bad. A thousand bytes of stack space is much too close to
any limits.

Do you have the stack traces for these things so that we can look at
worst offenders?

If the new block-mq code is to blame, it needs to be fixed.
__virtblk_add_req() has a 300-byte stack frame, it seems. Looking
elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
hopefully not in any normal path - online discard is moronic, and I'm
assuming XFS doesn't do that.

There's a lot of 200+ byte stack frames in block/blk-core.s, and they
all seem to be of the type perf_trace_block_buffer() - things created
with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
frame, I have no idea. That sounds like a potential disaster too,
although hopefully it's mostly leaf functions - but leaf functions
*deep* in the callchain. Tejun? Steven, why _do_ they end up with such
huge frames?

And if the stack use comes from the VFS layer, we can probably work on
that too. But I don't think that has really changed much lately..

                Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:28                   ` Linus Torvalds
@ 2014-02-12  7:18                     ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  7:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
> 
> That would also explain why it shows up for do_coredump(), even though
> clearly interrupts are not disabled at that point. It's just because
> do_coredump() opens a filename at a deeper point in the stack than the
> more normal system call paths do.

Right, it's exactly the same problem we have when knfsd is making
the VFS calls.

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

Yup. But it's when you see this sort of thing you realise that
almost no GFP_KERNEL memory allocation is really safe from stack
overruns, though:

  0)     6064      64   update_group_power+0x2c/0x270
  1)     6000     384   find_busiest_group+0x10a/0x8b0
  2)     5616     288   load_balance+0x165/0x870
  3)     5328      96   idle_balance+0x106/0x1b0
  4)     5232     112   __schedule+0x7b6/0x7e0
  5)     5120      16   schedule+0x29/0x70
  6)     5104     176   percpu_ida_alloc+0x1b3/0x3d0
  7)     4928      32   blk_mq_wait_for_tags+0x1f/0x40
  8)     4896      80   blk_mq_alloc_request_pinned+0x4e/0x110
  9)     4816     128   blk_mq_make_request+0x42b/0x510
 10)     4688      48   generic_make_request+0xc0/0x110
 11)     4640      96   submit_bio+0x71/0x120
 12)     4544     192   _xfs_buf_ioapply+0x2cc/0x3b0
 13)     4352      48   xfs_buf_iorequest+0x6f/0xc0
 14)     4304      32   xlog_bdstrat+0x23/0x60
 15)     4272      96   xlog_sync+0x3a4/0x5c0
 16)     4176      48   xlog_state_release_iclog+0x121/0x130
 17)     4128     192   xlog_write+0x700/0x7c0
 18)     3936     192   xlog_cil_push+0x2ae/0x3d0
 19)     3744     128   xlog_cil_force_lsn+0x1b8/0x1e0
 20)     3616     144   _xfs_log_force_lsn+0x7c/0x300
 21)     3472      48   xfs_log_force_lsn+0x3b/0xa0
 22)     3424     112   xfs_iunpin_wait+0xd7/0x160
 23)     3312      80   xfs_reclaim_inode+0xd0/0x350
 24)     3232     432   xfs_reclaim_inodes_ag+0x277/0x3d0
 25)     2800      48   xfs_reclaim_inodes_nr+0x33/0x40
 26)     2752      16   xfs_fs_free_cached_objects+0x15/0x20
 27)     2736      80   super_cache_scan+0x169/0x170
 28)     2656     160   shrink_slab_node+0x133/0x290
 29)     2496      80   shrink_slab+0x122/0x160
 30)     2416     112   do_try_to_free_pages+0x1de/0x360
 31)     2304     192   try_to_free_pages+0x110/0x190
 32)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
 33)     1856      80   alloc_pages_current+0xb2/0x170
 34)     1776      64   new_slab+0x265/0x2e0
 35)     1712     240   __slab_alloc+0x2fb/0x4c4
 36)     1472      80   kmem_cache_alloc+0x10b/0x140

That's almost 4700 bytes of stack usage from the
kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
Yes, it's an extreme case, but if you're looking for a smoking
gun.... :/

I can fix this one easily - we already have a workqueue for doing
async log pushes (will split the stack between xlog_cil_force_lsn
and xlog_cil_push), but the reason we haven't used it for synchronous
log forces is that screws up fsync performance on CFQ. We don't
recommend CFQ with XFS anyway, so I think I'll make this change
anyway.

> And some of the debug options that I'm sure DaveJ has enabled tend to
> make the stack usage worse (simply because they make a lot of data
> structures bigger).

True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of
XFS, but realistically 5% is not significant especially as we've
been hitting stack overflows with XFS on production systems
regularly enough on x86-64 over the past 2-3 years that we know what
"typical symptoms" of such overflows are...

The problem we have now is that everything outside XFS continues to
grow in stack usage, so the only choice that remains for us to avoid
overruns is to add performance impacting stack switches into the
middle of common XFS paths. We also have to do it unconditionally
because we don't know ahead of time if any specific operation is
going to require all the stack we can give it because - for example
- we can't predict when the IO path is going to require memory
allocation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  7:18                     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  7:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, xfs, Eric Sandeen, Al Viro, Linux Kernel

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
> 
> That would also explain why it shows up for do_coredump(), even though
> clearly interrupts are not disabled at that point. It's just because
> do_coredump() opens a filename at a deeper point in the stack than the
> more normal system call paths do.

Right, it's exactly the same problem we have when knfsd is making
the VFS calls.

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

Yup. But it's when you see this sort of thing you realise that
almost no GFP_KERNEL memory allocation is really safe from stack
overruns, though:

  0)     6064      64   update_group_power+0x2c/0x270
  1)     6000     384   find_busiest_group+0x10a/0x8b0
  2)     5616     288   load_balance+0x165/0x870
  3)     5328      96   idle_balance+0x106/0x1b0
  4)     5232     112   __schedule+0x7b6/0x7e0
  5)     5120      16   schedule+0x29/0x70
  6)     5104     176   percpu_ida_alloc+0x1b3/0x3d0
  7)     4928      32   blk_mq_wait_for_tags+0x1f/0x40
  8)     4896      80   blk_mq_alloc_request_pinned+0x4e/0x110
  9)     4816     128   blk_mq_make_request+0x42b/0x510
 10)     4688      48   generic_make_request+0xc0/0x110
 11)     4640      96   submit_bio+0x71/0x120
 12)     4544     192   _xfs_buf_ioapply+0x2cc/0x3b0
 13)     4352      48   xfs_buf_iorequest+0x6f/0xc0
 14)     4304      32   xlog_bdstrat+0x23/0x60
 15)     4272      96   xlog_sync+0x3a4/0x5c0
 16)     4176      48   xlog_state_release_iclog+0x121/0x130
 17)     4128     192   xlog_write+0x700/0x7c0
 18)     3936     192   xlog_cil_push+0x2ae/0x3d0
 19)     3744     128   xlog_cil_force_lsn+0x1b8/0x1e0
 20)     3616     144   _xfs_log_force_lsn+0x7c/0x300
 21)     3472      48   xfs_log_force_lsn+0x3b/0xa0
 22)     3424     112   xfs_iunpin_wait+0xd7/0x160
 23)     3312      80   xfs_reclaim_inode+0xd0/0x350
 24)     3232     432   xfs_reclaim_inodes_ag+0x277/0x3d0
 25)     2800      48   xfs_reclaim_inodes_nr+0x33/0x40
 26)     2752      16   xfs_fs_free_cached_objects+0x15/0x20
 27)     2736      80   super_cache_scan+0x169/0x170
 28)     2656     160   shrink_slab_node+0x133/0x290
 29)     2496      80   shrink_slab+0x122/0x160
 30)     2416     112   do_try_to_free_pages+0x1de/0x360
 31)     2304     192   try_to_free_pages+0x110/0x190
 32)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
 33)     1856      80   alloc_pages_current+0xb2/0x170
 34)     1776      64   new_slab+0x265/0x2e0
 35)     1712     240   __slab_alloc+0x2fb/0x4c4
 36)     1472      80   kmem_cache_alloc+0x10b/0x140

That's almost 4700 bytes of stack usage from the
kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
Yes, it's an extreme case, but if you're looking for a smoking
gun.... :/

I can fix this one easily - we already have a workqueue for doing
async log pushes (will split the stack between xlog_cil_force_lsn
and xlog_cil_push), but the reason we haven't used it for synchronous
log forces is that screws up fsync performance on CFQ. We don't
recommend CFQ with XFS anyway, so I think I'll make this change
anyway.

> And some of the debug options that I'm sure DaveJ has enabled tend to
> make the stack usage worse (simply because they make a lot of data
> structures bigger).

True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of
XFS, but realistically 5% is not significant especially as we've
been hitting stack overflows with XFS on production systems
regularly enough on x86-64 over the past 2-3 years that we know what
"typical symptoms" of such overflows are...

The problem we have now is that everything outside XFS continues to
grow in stack usage, so the only choice that remains for us to avoid
overruns is to add performance impacting stack switches into the
middle of common XFS paths. We also have to do it unconditionally
because we don't know ahead of time if any specific operation is
going to require all the stack we can give it because - for example
- we can't predict when the IO path is going to require memory
allocation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:59                         ` Linus Torvalds
@ 2014-02-12  8:13                           ` Tejun Heo
  -1 siblings, 0 replies; 132+ messages in thread
From: Tejun Heo @ 2014-02-12  8:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jens Axboe, Steven Rostedt, Dave Jones, Al Viro,
	Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

It looks like they're essentially the same for all the automatically
generated trace functions.  I'm seeing 232 byte stack frame in most of
them.  If I'm not completely confused by these macros, these are
generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
contains struct pt_regs in the stack frame which is already 168 bytes,
so that seems like the culprit.  No idea whether this is something
avoidable.  At least they shouldn't nest in any way.  Steven?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  8:13                           ` Tejun Heo
  0 siblings, 0 replies; 132+ messages in thread
From: Tejun Heo @ 2014-02-12  8:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Eric Sandeen, Linux Kernel, Steven Rostedt, xfs,
	Al Viro, Dave Jones

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

It looks like they're essentially the same for all the automatically
generated trace functions.  I'm seeing 232 byte stack frame in most of
them.  If I'm not completely confused by these macros, these are
generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
contains struct pt_regs in the stack frame which is already 168 bytes,
so that seems like the culprit.  No idea whether this is something
avoidable.  At least they shouldn't nest in any way.  Steven?

Thanks.

-- 
tejun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:59                         ` Linus Torvalds
@ 2014-02-12  8:35                           ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  8:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Tejun Heo, Steven Rostedt, Dave Jones, Al Viro,
	Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?

Sure. It's almost through the XFS block allocation path where we do
metadata reads or log writes. That's always been the worst case
stack usage path for XFS. The worst path is about 4800 bytes - it
was a double btree split (inode extent map tree split, followed by a
freespace btree split when allocating a block for the extent map
btree split), triggering a reallocation of a btree block that was
pinned and STALE, which triggered a log force, which then made it's
way into the block layer.

I don't have a stack trace of that because I can't create the
initial conditions required to trigger it on demand. I've only ever
seen it in the wild once. More common, though, is the situation we
see here - somewhere around 3-4k of stack usage from a single extent
btree update operation.

We've already got a stack switch in the data allocation path - we
had to add it to stop the bdi-flusher from busting the stack with
alarming regularity in production systems. I initially made the
metadata allocation paths use it as well, but that part got reverted
(aa29284 xfs: don't defer metadata allocation to the workqueue)
because Mel Gorman demonstrated several workloads that regressed
significantly as a result of making that change.

So now we are stuck betweeen a rock and a hard place - those
metadata block allocation paths are triggering the stack overflows,
but switching stacks in the allocation path to avoid stack overflows
causes massive performance regressions....

The thing is, the XFS stack usage has not changed significantly over
the past 5 years - we keep a pretty close eye on it. The issue is
that everything around XFS has slowly been growing and I can't
point you at one stack trace that demonstrates the worst case. What
I can point you to function calls that consume a lot of stack space
and hence limit what is available to callers.

So here's a few stack fragements I've generated in teh past few
minutes on a couple of XFS filesystems inside a 16p/16GB VM by
running:

$ dd if=/dev/zero of=/mnt/test/foo bs=1024k &

which writes at about 600MB/s to a 18TB RAID array. That runs in
parallel with a multithreaded inode creation/walk/remove workload on
a different filesystem (creates 50 million inodes ~250k inode
creates/s, walks them removes them at around 170k removes/s) which
is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs.

mutex_lock() requires at least 1.2k of stack because of the
scheduler overhead.:

$ sudo cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (39 entries)
        -----    ----   --------
  0)     4368      64   update_group_power+0x2c/0x270
  1)     4304     384   find_busiest_group+0x10a/0x8b0
  2)     3920     288   load_balance+0x165/0x870
  3)     3632      96   idle_balance+0x106/0x1b0
  4)     3536     112   __schedule+0x7b6/0x7e0
  5)     3424      16   schedule+0x29/0x70
  6)     3408      16   schedule_preempt_disabled+0xe/0x10
  7)     3392     112   __mutex_lock_slowpath+0x11a/0x400
  8)     3280      32   mutex_lock+0x1e/0x32

i.e. any function that is going to schedule needs at least 1k of
stack, and some of the lead in  infrastructure (like
wait_for_completion) can require a total of up to 1.5k...

Basic memory allocation can easily >1k of stack without
entering reclaim:

        -----    ----   --------
  0)     4920      40   zone_statistics+0xbd/0xc0
  1)     4880     256   get_page_from_freelist+0x3a8/0x8a0
  2)     4624     256   __alloc_pages_nodemask+0x143/0x8e0
  3)     4368      80   alloc_pages_current+0xb2/0x170
  4)     4288      64   new_slab+0x265/0x2e0
  5)     4224     240   __slab_alloc+0x2fb/0x4c4
  6)     3984      80   kmem_cache_alloc+0x10b/0x140
  7)     3904      48   kmem_zone_alloc+0x77/0x100

If it enters reclaim and we are allowed to IO? Well, you saw the
stack I posted in the other thread - 4700 bytes from
kmem_cache_alloc() to the top of the stack. Another bad case is
the swap path (using a scsi device rather than virtio for the
root/swap devices):

        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     4496      16   mempool_alloc_slab+0x15/0x20
  1)     4480     128   mempool_alloc+0x66/0x170
  2)     4352      16   scsi_sg_alloc+0x4e/0x60
  3)     4336     112   __sg_alloc_table+0x68/0x130
  4)     4224      32   scsi_init_sgtable+0x34/0x90
  5)     4192      48   scsi_init_io+0x34/0xd0
  6)     4144      32   scsi_setup_fs_cmnd+0x66/0xa0
  7)     4112     112   sd_prep_fn+0x2a0/0xce0
  8)     4000      48   blk_peek_request+0x13c/0x260
  9)     3952     112   scsi_request_fn+0x4b/0x490
 10)     3840      32   __blk_run_queue+0x37/0x50
 11)     3808      64   queue_unplugged+0x39/0xb0
 12)     3744     112   blk_flush_plug_list+0x20b/0x240
 13)     3632      80   blk_queue_bio+0x1ca/0x310
 14)     3552      48   generic_make_request+0xc0/0x110
 15)     3504      96   submit_bio+0x71/0x120
 16)     3408     160   __swap_writepage+0x184/0x220
 17)     3248      32   swap_writepage+0x42/0x90
 18)     3216     304   shrink_page_list+0x6fd/0xa20
 19)     2912     208   shrink_inactive_list+0x243/0x480
 20)     2704     288   shrink_lruvec+0x371/0x670
 21)     2416     112   do_try_to_free_pages+0x11a/0x360
 22)     2304     192   try_to_free_pages+0x110/0x190
 23)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
 24)     1856      80   alloc_pages_current+0xb2/0x170
 25)     1776      64   new_slab+0x265/0x2e0
 26)     1712     240   __slab_alloc+0x2fb/0x4c4
 27)     1472      80   kmem_cache_alloc+0x10b/0x140
 28)     1392      48   kmem_zone_alloc+0x77/0x100

There's over 3k in the stack from kmem_zone_alloc(GFP_KERNEL) and
that's got nothing XFS in it at all - it's just memory allocation,
reclaim and Io path. And it also demonstrates that the scsi layer
has significant stack usage.

here's 2.5k from submit_bio() on virtio block device:

        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     4512      64   update_curr+0x8b/0x160
  1)     4448      96   enqueue_task_fair+0x39d/0xd50
  2)     4352      48   enqueue_task+0x50/0x60
  3)     4304      16   activate_task+0x23/0x30
  4)     4288      32   ttwu_do_activate.constprop.84+0x3c/0x70
  5)     4256      96   try_to_wake_up+0x1b4/0x2c0
  6)     4160      16   default_wake_function+0x12/0x20
  7)     4144      32   autoremove_wake_function+0x16/0x40
  8)     4112      80   __wake_up_common+0x55/0x90
  9)     4032      64   __wake_up+0x48/0x70
 10)     3968      80   wakeup_kswapd+0xe7/0x130
 11)     3888     256   __alloc_pages_nodemask+0x371/0x8e0
 12)     3632      80   alloc_pages_current+0xb2/0x170
 13)     3552      64   new_slab+0x265/0x2e0
 14)     3488     240   __slab_alloc+0x2fb/0x4c4
 15)     3248      80   __kmalloc+0x133/0x180
 16)     3168     112   virtqueue_add_sgs+0x2fe/0x520
 17)     3056     288   __virtblk_add_req+0xd5/0x180
 18)     2768      96   virtio_queue_rq+0xdd/0x1d0
 19)     2672     112   __blk_mq_run_hw_queue+0x1c3/0x3c0
 20)     2560      16   blk_mq_run_hw_queue+0x35/0x40
 21)     2544      80   blk_mq_insert_requests+0xc5/0x120
 22)     2464      96   blk_mq_flush_plug_list+0x129/0x140
 23)     2368     112   blk_flush_plug_list+0xe7/0x240
 24)     2256     128   blk_mq_make_request+0x3ca/0x510
 25)     2128      48   generic_make_request+0xc0/0x110
 26)     2080      96   submit_bio+0x71/0x120

You can also see the difference in stack usage that the blk_mq layer
adds compared to the previous trace that went through the old
code to get to the SCSI stack.

And you saw the memory reclaim via shrinker path taht I posted in
another email - that was 4700 bytes from kmem_cache_alloc() to teh
top of the stack (can be reduced to about 2000 bytes by having XFS
chop it in half).

> If the new block-mq code is to blame, it needs to be fixed.

It's a symptom of the pattern of gradual growth in just about every
core subsystem. Each individual change is not significant, but when
you put the whole stack together the canary is well and truly dead.

> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.

We *support* it, but we don't *recommend* it. So, no, that's not the
problem path. ;)

> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
on x86-64 (at least, according to the documentation).
CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
stack usage than the tracer. As such, I also think that
CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
stack usage because it is canary based and so captures the very
worst case usage of the process's stack...

It really seems to me that we have got to the point where it is not
safe to do anything like submit IO or try to do memory allocation
with any more than half the stack space consumed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12  8:35                           ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12  8:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Eric Sandeen, Linux Kernel, Steven Rostedt, xfs,
	Al Viro, Tejun Heo, Dave Jones

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?

Sure. It's almost through the XFS block allocation path where we do
metadata reads or log writes. That's always been the worst case
stack usage path for XFS. The worst path is about 4800 bytes - it
was a double btree split (inode extent map tree split, followed by a
freespace btree split when allocating a block for the extent map
btree split), triggering a reallocation of a btree block that was
pinned and STALE, which triggered a log force, which then made it's
way into the block layer.

I don't have a stack trace of that because I can't create the
initial conditions required to trigger it on demand. I've only ever
seen it in the wild once. More common, though, is the situation we
see here - somewhere around 3-4k of stack usage from a single extent
btree update operation.

We've already got a stack switch in the data allocation path - we
had to add it to stop the bdi-flusher from busting the stack with
alarming regularity in production systems. I initially made the
metadata allocation paths use it as well, but that part got reverted
(aa29284 xfs: don't defer metadata allocation to the workqueue)
because Mel Gorman demonstrated several workloads that regressed
significantly as a result of making that change.

So now we are stuck betweeen a rock and a hard place - those
metadata block allocation paths are triggering the stack overflows,
but switching stacks in the allocation path to avoid stack overflows
causes massive performance regressions....

The thing is, the XFS stack usage has not changed significantly over
the past 5 years - we keep a pretty close eye on it. The issue is
that everything around XFS has slowly been growing and I can't
point you at one stack trace that demonstrates the worst case. What
I can point you to function calls that consume a lot of stack space
and hence limit what is available to callers.

So here's a few stack fragements I've generated in teh past few
minutes on a couple of XFS filesystems inside a 16p/16GB VM by
running:

$ dd if=/dev/zero of=/mnt/test/foo bs=1024k &

which writes at about 600MB/s to a 18TB RAID array. That runs in
parallel with a multithreaded inode creation/walk/remove workload on
a different filesystem (creates 50 million inodes ~250k inode
creates/s, walks them removes them at around 170k removes/s) which
is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs.

mutex_lock() requires at least 1.2k of stack because of the
scheduler overhead.:

$ sudo cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (39 entries)
        -----    ----   --------
  0)     4368      64   update_group_power+0x2c/0x270
  1)     4304     384   find_busiest_group+0x10a/0x8b0
  2)     3920     288   load_balance+0x165/0x870
  3)     3632      96   idle_balance+0x106/0x1b0
  4)     3536     112   __schedule+0x7b6/0x7e0
  5)     3424      16   schedule+0x29/0x70
  6)     3408      16   schedule_preempt_disabled+0xe/0x10
  7)     3392     112   __mutex_lock_slowpath+0x11a/0x400
  8)     3280      32   mutex_lock+0x1e/0x32

i.e. any function that is going to schedule needs at least 1k of
stack, and some of the lead in  infrastructure (like
wait_for_completion) can require a total of up to 1.5k...

Basic memory allocation can easily >1k of stack without
entering reclaim:

        -----    ----   --------
  0)     4920      40   zone_statistics+0xbd/0xc0
  1)     4880     256   get_page_from_freelist+0x3a8/0x8a0
  2)     4624     256   __alloc_pages_nodemask+0x143/0x8e0
  3)     4368      80   alloc_pages_current+0xb2/0x170
  4)     4288      64   new_slab+0x265/0x2e0
  5)     4224     240   __slab_alloc+0x2fb/0x4c4
  6)     3984      80   kmem_cache_alloc+0x10b/0x140
  7)     3904      48   kmem_zone_alloc+0x77/0x100

If it enters reclaim and we are allowed to IO? Well, you saw the
stack I posted in the other thread - 4700 bytes from
kmem_cache_alloc() to the top of the stack. Another bad case is
the swap path (using a scsi device rather than virtio for the
root/swap devices):

        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     4496      16   mempool_alloc_slab+0x15/0x20
  1)     4480     128   mempool_alloc+0x66/0x170
  2)     4352      16   scsi_sg_alloc+0x4e/0x60
  3)     4336     112   __sg_alloc_table+0x68/0x130
  4)     4224      32   scsi_init_sgtable+0x34/0x90
  5)     4192      48   scsi_init_io+0x34/0xd0
  6)     4144      32   scsi_setup_fs_cmnd+0x66/0xa0
  7)     4112     112   sd_prep_fn+0x2a0/0xce0
  8)     4000      48   blk_peek_request+0x13c/0x260
  9)     3952     112   scsi_request_fn+0x4b/0x490
 10)     3840      32   __blk_run_queue+0x37/0x50
 11)     3808      64   queue_unplugged+0x39/0xb0
 12)     3744     112   blk_flush_plug_list+0x20b/0x240
 13)     3632      80   blk_queue_bio+0x1ca/0x310
 14)     3552      48   generic_make_request+0xc0/0x110
 15)     3504      96   submit_bio+0x71/0x120
 16)     3408     160   __swap_writepage+0x184/0x220
 17)     3248      32   swap_writepage+0x42/0x90
 18)     3216     304   shrink_page_list+0x6fd/0xa20
 19)     2912     208   shrink_inactive_list+0x243/0x480
 20)     2704     288   shrink_lruvec+0x371/0x670
 21)     2416     112   do_try_to_free_pages+0x11a/0x360
 22)     2304     192   try_to_free_pages+0x110/0x190
 23)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
 24)     1856      80   alloc_pages_current+0xb2/0x170
 25)     1776      64   new_slab+0x265/0x2e0
 26)     1712     240   __slab_alloc+0x2fb/0x4c4
 27)     1472      80   kmem_cache_alloc+0x10b/0x140
 28)     1392      48   kmem_zone_alloc+0x77/0x100

There's over 3k in the stack from kmem_zone_alloc(GFP_KERNEL) and
that's got nothing XFS in it at all - it's just memory allocation,
reclaim and Io path. And it also demonstrates that the scsi layer
has significant stack usage.

here's 2.5k from submit_bio() on virtio block device:

        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     4512      64   update_curr+0x8b/0x160
  1)     4448      96   enqueue_task_fair+0x39d/0xd50
  2)     4352      48   enqueue_task+0x50/0x60
  3)     4304      16   activate_task+0x23/0x30
  4)     4288      32   ttwu_do_activate.constprop.84+0x3c/0x70
  5)     4256      96   try_to_wake_up+0x1b4/0x2c0
  6)     4160      16   default_wake_function+0x12/0x20
  7)     4144      32   autoremove_wake_function+0x16/0x40
  8)     4112      80   __wake_up_common+0x55/0x90
  9)     4032      64   __wake_up+0x48/0x70
 10)     3968      80   wakeup_kswapd+0xe7/0x130
 11)     3888     256   __alloc_pages_nodemask+0x371/0x8e0
 12)     3632      80   alloc_pages_current+0xb2/0x170
 13)     3552      64   new_slab+0x265/0x2e0
 14)     3488     240   __slab_alloc+0x2fb/0x4c4
 15)     3248      80   __kmalloc+0x133/0x180
 16)     3168     112   virtqueue_add_sgs+0x2fe/0x520
 17)     3056     288   __virtblk_add_req+0xd5/0x180
 18)     2768      96   virtio_queue_rq+0xdd/0x1d0
 19)     2672     112   __blk_mq_run_hw_queue+0x1c3/0x3c0
 20)     2560      16   blk_mq_run_hw_queue+0x35/0x40
 21)     2544      80   blk_mq_insert_requests+0xc5/0x120
 22)     2464      96   blk_mq_flush_plug_list+0x129/0x140
 23)     2368     112   blk_flush_plug_list+0xe7/0x240
 24)     2256     128   blk_mq_make_request+0x3ca/0x510
 25)     2128      48   generic_make_request+0xc0/0x110
 26)     2080      96   submit_bio+0x71/0x120

You can also see the difference in stack usage that the blk_mq layer
adds compared to the previous trace that went through the old
code to get to the SCSI stack.

And you saw the memory reclaim via shrinker path taht I posted in
another email - that was 4700 bytes from kmem_cache_alloc() to teh
top of the stack (can be reduced to about 2000 bytes by having XFS
chop it in half).

> If the new block-mq code is to blame, it needs to be fixed.

It's a symptom of the pattern of gradual growth in just about every
core subsystem. Each individual change is not significant, but when
you put the whole stack together the canary is well and truly dead.

> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.

We *support* it, but we don't *recommend* it. So, no, that's not the
problem path. ;)

> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
on x86-64 (at least, according to the documentation).
CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
stack usage than the tracer. As such, I also think that
CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
stack usage because it is canary based and so captures the very
worst case usage of the process's stack...

It really seems to me that we have got to the point where it is not
safe to do anything like submit IO or try to do memory allocation
with any more than half the stack space consumed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:28                   ` Linus Torvalds
@ 2014-02-12 11:39                     ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 11:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

We could, in principle, put it into task_struct and make get_signal()
return its address - do_signal() is called only in the code that does
assorted returns to userland...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 11:39                     ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 11:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

We could, in principle, put it into task_struct and make get_signal()
return its address - do_signal() is called only in the code that does
assorted returns to userland...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:59                         ` Linus Torvalds
@ 2014-02-12 12:40                           ` Steven Rostedt
  -1 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jens Axboe, Tejun Heo, Dave Jones, Al Viro,
	Eric Sandeen, Linux Kernel, xfs, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker

[ Added the perf tracepoint maintainers ]

On Tue, 11 Feb 2014 22:59:58 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?
> 
> If the new block-mq code is to blame, it needs to be fixed.
> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.
> 
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

The perf_trace_##event is defined in include/trace/ftrace.h.

There we have this:

perf_trace_##call(void *__data, proto)					\
{									\
	struct ftrace_event_call *event_call = __data;			\
	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
	struct ftrace_raw_##call *entry;				\
	struct pt_regs __regs;						\
	u64 __addr = 0, __count = 1;					\
	struct task_struct *__task = NULL;				\
	struct hlist_head *head;					\
	int __entry_size;						\
	int __data_size;						\
	int rctx;							\
									\

Mostly pointers except for two structures. The __data_offests, is
dynamically defined, and only consists of values from the tracepoint
entry_structure that defines dynamic arrays. But the other structure on
the stack looks a bit harrier. The pt_regs structure.

That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
culprit here.

Peter and Frederic, is there a way not to store that on the stack?

-- Steve

> 
> And if the stack use comes from the VFS layer, we can probably work on
> that too. But I don't think that has really changed much lately..
> 
>                 Linus


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 12:40                           ` Steven Rostedt
  0 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Peter Zijlstra, Eric Sandeen, Linux Kernel, xfs,
	Al Viro, Frederic Weisbecker, Tejun Heo, Dave Jones, Ingo Molnar

[ Added the perf tracepoint maintainers ]

On Tue, 11 Feb 2014 22:59:58 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?
> 
> If the new block-mq code is to blame, it needs to be fixed.
> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.
> 
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

The perf_trace_##event is defined in include/trace/ftrace.h.

There we have this:

perf_trace_##call(void *__data, proto)					\
{									\
	struct ftrace_event_call *event_call = __data;			\
	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
	struct ftrace_raw_##call *entry;				\
	struct pt_regs __regs;						\
	u64 __addr = 0, __count = 1;					\
	struct task_struct *__task = NULL;				\
	struct hlist_head *head;					\
	int __entry_size;						\
	int __data_size;						\
	int rctx;							\
									\

Mostly pointers except for two structures. The __data_offests, is
dynamically defined, and only consists of values from the tracepoint
entry_structure that defines dynamic arrays. But the other structure on
the stack looks a bit harrier. The pt_regs structure.

That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
culprit here.

Peter and Frederic, is there a way not to store that on the stack?

-- Steve

> 
> And if the stack use comes from the VFS layer, we can probably work on
> that too. But I don't think that has really changed much lately..
> 
>                 Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  8:13                           ` Tejun Heo
@ 2014-02-12 12:44                             ` Steven Rostedt
  -1 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Dave Jones, Al Viro,
	Eric Sandeen, Linux Kernel, xfs, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker

On Wed, 12 Feb 2014 03:13:33 -0500
Tejun Heo <tj@kernel.org> wrote:

> On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> > There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> > all seem to be of the type perf_trace_block_buffer() - things created
> > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> > frame, I have no idea. That sounds like a potential disaster too,
> > although hopefully it's mostly leaf functions - but leaf functions
> > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> > huge frames?
> 
> It looks like they're essentially the same for all the automatically
> generated trace functions.  I'm seeing 232 byte stack frame in most of
> them.  If I'm not completely confused by these macros, these are
> generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
> contains struct pt_regs in the stack frame which is already 168 bytes,
> so that seems like the culprit.  No idea whether this is something
> avoidable.  At least they shouldn't nest in any way.  Steven?

They shouldn't nest. But if the perf tracepoint is active, I don't know
how much more of the stack is used in the functions that tracepoint
calls.

-- Steve

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 12:44                             ` Steven Rostedt
  0 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Peter Zijlstra, Eric Sandeen, Linux Kernel, xfs,
	Al Viro, Frederic Weisbecker, Dave Jones, Linus Torvalds,
	Ingo Molnar

On Wed, 12 Feb 2014 03:13:33 -0500
Tejun Heo <tj@kernel.org> wrote:

> On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> > There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> > all seem to be of the type perf_trace_block_buffer() - things created
> > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> > frame, I have no idea. That sounds like a potential disaster too,
> > although hopefully it's mostly leaf functions - but leaf functions
> > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> > huge frames?
> 
> It looks like they're essentially the same for all the automatically
> generated trace functions.  I'm seeing 232 byte stack frame in most of
> them.  If I'm not completely confused by these macros, these are
> generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
> contains struct pt_regs in the stack frame which is already 168 bytes,
> so that seems like the culprit.  No idea whether this is something
> avoidable.  At least they shouldn't nest in any way.  Steven?

They shouldn't nest. But if the perf tracepoint is active, I don't know
how much more of the stack is used in the functions that tracepoint
calls.

-- Steve

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  8:35                           ` Dave Chinner
@ 2014-02-12 12:50                             ` Steven Rostedt
  -1 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jens Axboe, Tejun Heo, Dave Jones, Al Viro,
	Eric Sandeen, Linux Kernel, xfs

On Wed, 12 Feb 2014 19:35:13 +1100
Dave Chinner <david@fromorbit.com> wrote:


> And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
> on x86-64 (at least, according to the documentation).
> CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
> stack usage than the tracer. As such, I also think that
> CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
> stack usage because it is canary based and so captures the very
> worst case usage of the process's stack...

Yeah, with the new fentry (adding the mcount call before setting up the
stack frame), the function tracer can not catch leaf functions, as it
is called before the leaf function's frame is set up.

Hmm, I wonder if I should add a config to disable fentry and go back to
the old mcount that gets called after setting up the stack frame. This
will lead to better stack tracing, but you lose out on all the benefits
of fentry.

-- Steve

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 12:50                             ` Steven Rostedt
  0 siblings, 0 replies; 132+ messages in thread
From: Steven Rostedt @ 2014-02-12 12:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Eric Sandeen, Linux Kernel, xfs, Al Viro, Tejun Heo,
	Dave Jones, Linus Torvalds

On Wed, 12 Feb 2014 19:35:13 +1100
Dave Chinner <david@fromorbit.com> wrote:


> And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
> on x86-64 (at least, according to the documentation).
> CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
> stack usage than the tracer. As such, I also think that
> CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
> stack usage because it is canary based and so captures the very
> worst case usage of the process's stack...

Yeah, with the new fentry (adding the mcount call before setting up the
stack frame), the function tracer can not catch leaf functions, as it
is called before the leaf function's frame is set up.

Hmm, I wonder if I should add a config to disable fentry and go back to
the old mcount that gets called after setting up the stack frame. This
will lead to better stack tracing, but you lose out on all the benefits
of fentry.

-- Steve

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 12:40                           ` Steven Rostedt
@ 2014-02-12 13:29                             ` Peter Zijlstra
  -1 siblings, 0 replies; 132+ messages in thread
From: Peter Zijlstra @ 2014-02-12 13:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Tejun Heo, Dave Jones,
	Al Viro, Eric Sandeen, Linux Kernel, xfs, Ingo Molnar,
	Frederic Weisbecker

On Wed, Feb 12, 2014 at 07:40:36AM -0500, Steven Rostedt wrote:
> The pt_regs structure.
> 
> That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
> culprit here.
> 
> Peter and Frederic, is there a way not to store that on the stack?

Something like so?

---
 include/trace/ftrace.h          | 7 ++++---
 kernel/trace/trace_event_perf.c | 5 ++++-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 1a8b28db3775..87ae3ef1d278 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -678,7 +678,7 @@ perf_trace_##call(void *__data, proto)					\
 	struct ftrace_event_call *event_call = __data;			\
 	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
 	struct ftrace_raw_##call *entry;				\
-	struct pt_regs __regs;						\
+	struct pt_regs *__regs;						\
 	u64 __addr = 0, __count = 1;					\
 	struct task_struct *__task = NULL;				\
 	struct hlist_head *head;					\
@@ -697,18 +697,19 @@ perf_trace_##call(void *__data, proto)					\
 			     sizeof(u64));				\
 	__entry_size -= sizeof(u32);					\
 									\
-	perf_fetch_caller_regs(&__regs);				\
 	entry = perf_trace_buf_prepare(__entry_size,			\
 			event_call->event.type, &__regs, &rctx);	\
 	if (!entry)							\
 		return;							\
 									\
+	perf_fetch_caller_regs(__regs);					\
+									\
 	tstruct								\
 									\
 	{ assign; }							\
 									\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
-		__count, &__regs, head, __task);			\
+		__count, __regs, head, __task);				\
 }
 
 /*
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index e854f420e033..1885f4aac109 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -232,8 +232,10 @@ void perf_trace_del(struct perf_event *p_event, int flags)
 	tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
 }
 
+static DEFINE_PER_CPU(struct pt_regs, tp_regs[4]);
+
 __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
-				       struct pt_regs *regs, int *rctxp)
+				       struct pt_regs **regs, int *rctxp)
 {
 	struct trace_entry *entry;
 	unsigned long flags;
@@ -252,6 +254,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
 	if (*rctxp < 0)
 		return NULL;
 
+	*regs = this_cpu_ptr(&tp_regs[*rctxp]);
 	raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
 
 	/* zero the dead bytes from align to not leak stack to user */


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 13:29                             ` Peter Zijlstra
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Zijlstra @ 2014-02-12 13:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Axboe, Eric Sandeen, Linux Kernel, xfs, Al Viro,
	Frederic Weisbecker, Tejun Heo, Dave Jones, Linus Torvalds,
	Ingo Molnar

On Wed, Feb 12, 2014 at 07:40:36AM -0500, Steven Rostedt wrote:
> The pt_regs structure.
> 
> That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
> culprit here.
> 
> Peter and Frederic, is there a way not to store that on the stack?

Something like so?

---
 include/trace/ftrace.h          | 7 ++++---
 kernel/trace/trace_event_perf.c | 5 ++++-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 1a8b28db3775..87ae3ef1d278 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -678,7 +678,7 @@ perf_trace_##call(void *__data, proto)					\
 	struct ftrace_event_call *event_call = __data;			\
 	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
 	struct ftrace_raw_##call *entry;				\
-	struct pt_regs __regs;						\
+	struct pt_regs *__regs;						\
 	u64 __addr = 0, __count = 1;					\
 	struct task_struct *__task = NULL;				\
 	struct hlist_head *head;					\
@@ -697,18 +697,19 @@ perf_trace_##call(void *__data, proto)					\
 			     sizeof(u64));				\
 	__entry_size -= sizeof(u32);					\
 									\
-	perf_fetch_caller_regs(&__regs);				\
 	entry = perf_trace_buf_prepare(__entry_size,			\
 			event_call->event.type, &__regs, &rctx);	\
 	if (!entry)							\
 		return;							\
 									\
+	perf_fetch_caller_regs(__regs);					\
+									\
 	tstruct								\
 									\
 	{ assign; }							\
 									\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
-		__count, &__regs, head, __task);			\
+		__count, __regs, head, __task);				\
 }
 
 /*
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index e854f420e033..1885f4aac109 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -232,8 +232,10 @@ void perf_trace_del(struct perf_event *p_event, int flags)
 	tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
 }
 
+static DEFINE_PER_CPU(struct pt_regs, tp_regs[4]);
+
 __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
-				       struct pt_regs *regs, int *rctxp)
+				       struct pt_regs **regs, int *rctxp)
 {
 	struct trace_entry *entry;
 	unsigned long flags;
@@ -252,6 +254,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
 	if (*rctxp < 0)
 		return NULL;
 
+	*regs = this_cpu_ptr(&tp_regs[*rctxp]);
 	raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
 
 	/* zero the dead bytes from align to not leak stack to user */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:10                     ` Dave Chinner
@ 2014-02-12 14:25                       ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12 14:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
 > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
 > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
 > > 
 > >  > None of the XFS code disables interrupts in that path, not does is
 > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
 > >  > this point and I know that the standard (non stacked) IO stack can
 > >  > consume >3kb of stack space when it gets down to having to do memory
 > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > >  > drivers. Stack overruns typically show up with symptoms like we are
 > >  > seeing.
 > >  > ..
 > >  > 
 > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
 > >  > turn on stack overrun detection?
 > > 
 > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
 > 
 > That only checks stack usage when an interrupt is taken. If no
 > interrupts are taken when stack usage is within 128 bytes of
 > overflow, then it doesn't catch it.
 > 
 > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
 > stack usage of a process via canary overwrites and it records it in
 > do_exit().

I had that on too. The only message from it came from quite a while
before the trace that happened overnight..

[ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
[12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

 > I also use the stack tracer to record the largest stack
 > usage seen so I know exactly what code paths are approaching stack
 > overruns...

I can give that a try later.

	Dave


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 14:25                       ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-12 14:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs, Linus Torvalds, Al Viro, Linux Kernel

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
 > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
 > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
 > > 
 > >  > None of the XFS code disables interrupts in that path, not does is
 > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
 > >  > this point and I know that the standard (non stacked) IO stack can
 > >  > consume >3kb of stack space when it gets down to having to do memory
 > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > >  > drivers. Stack overruns typically show up with symptoms like we are
 > >  > seeing.
 > >  > ..
 > >  > 
 > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
 > >  > turn on stack overrun detection?
 > > 
 > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
 > 
 > That only checks stack usage when an interrupt is taken. If no
 > interrupts are taken when stack usage is within 128 bytes of
 > overflow, then it doesn't catch it.
 > 
 > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
 > stack usage of a process via canary overwrites and it records it in
 > do_exit().

I had that on too. The only message from it came from quite a while
before the trace that happened overnight..

[ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
[12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

 > I also use the stack tracer to record the largest stack
 > usage seen so I know exactly what code paths are approaching stack
 > overruns...

I can give that a try later.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  6:10                     ` Dave Chinner
@ 2014-02-12 15:57                       ` Eric Sandeen
  -1 siblings, 0 replies; 132+ messages in thread
From: Eric Sandeen @ 2014-02-12 15:57 UTC (permalink / raw)
  To: Dave Chinner, Dave Jones, Al Viro, Linus Torvalds, Linux Kernel, xfs

On 2/12/14, 12:10 AM, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>>
>>  > None of the XFS code disables interrupts in that path, not does is
>>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>>  > this point and I know that the standard (non stacked) IO stack can
>>  > consume >3kb of stack space when it gets down to having to do memory
>>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>>  > drivers. Stack overruns typically show up with symptoms like we are
>>  > seeing.
>>  > ..
>>  > 
>>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>>  > turn on stack overrun detection?
>>
>> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...
> 
> Cheers,
> 
> Dave.
> 


I'm not sure if I'm off base here, but maybe this would make sense: check
for a corrupted stack in __might_sleep.  Compile tested only,
possibly inelegant, and/or completely wrong, but:


From: Eric Sandeen <sandeen@redhat.com>

sched: Test for corrupted task_struct in __might_sleep

If a thread overruns the stack, it may corrupt the task_struct,
leading to false positives on tests like irqs_disabled().

Warn if this seems to be the case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..6920c3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6934,6 +6934,8 @@ static inline int preempt_count_equals(int preempt_offset)
 
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
+	struct task_struct *tsk = current;
+	unsigned long *stackend;
 	static unsigned long prev_jiffy;	/* ratelimiting */
 
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
@@ -6952,6 +6954,11 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 			in_atomic(), irqs_disabled(),
 			current->pid, current->comm);
 
+	/* A corrupted stack can cause a false positive on irqs_disabled etc */
+	stackend = end_of_stack(tsk);
+	if (tsk != &init_task && *stackend != STACK_END_MAGIC)
+		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
+	
 	debug_show_held_locks(current);
 	if (irqs_disabled())
 		print_irqtrace_events(current);



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 15:57                       ` Eric Sandeen
  0 siblings, 0 replies; 132+ messages in thread
From: Eric Sandeen @ 2014-02-12 15:57 UTC (permalink / raw)
  To: Dave Chinner, Dave Jones, Al Viro, Linus Torvalds, Linux Kernel, xfs

On 2/12/14, 12:10 AM, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>>
>>  > None of the XFS code disables interrupts in that path, not does is
>>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>>  > this point and I know that the standard (non stacked) IO stack can
>>  > consume >3kb of stack space when it gets down to having to do memory
>>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>>  > drivers. Stack overruns typically show up with symptoms like we are
>>  > seeing.
>>  > ..
>>  > 
>>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>>  > turn on stack overrun detection?
>>
>> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...
> 
> Cheers,
> 
> Dave.
> 


I'm not sure if I'm off base here, but maybe this would make sense: check
for a corrupted stack in __might_sleep.  Compile tested only,
possibly inelegant, and/or completely wrong, but:


From: Eric Sandeen <sandeen@redhat.com>

sched: Test for corrupted task_struct in __might_sleep

If a thread overruns the stack, it may corrupt the task_struct,
leading to false positives on tests like irqs_disabled().

Warn if this seems to be the case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..6920c3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6934,6 +6934,8 @@ static inline int preempt_count_equals(int preempt_offset)
 
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
+	struct task_struct *tsk = current;
+	unsigned long *stackend;
 	static unsigned long prev_jiffy;	/* ratelimiting */
 
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
@@ -6952,6 +6954,11 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 			in_atomic(), irqs_disabled(),
 			current->pid, current->comm);
 
+	/* A corrupted stack can cause a false positive on irqs_disabled etc */
+	stackend = end_of_stack(tsk);
+	if (tsk != &init_task && *stackend != STACK_END_MAGIC)
+		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
+	
 	debug_show_held_locks(current);
 	if (irqs_disabled())
 		print_irqtrace_events(current);


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 11:39                     ` Al Viro
@ 2014-02-12 20:13                       ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12 20:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
>
>> It looks like just "do_signal()" has a stack frame that is about 230
>> bytes even under normal circumstancs (largely due to "struct ksignal"
>> - which in turn is largely due to the insane 128-byte padding in
>> siginfo_t). Add a few other frames in there, and I guess that if it
>> was close before, the coredump path just makes it go off.
>
> We could, in principle, put it into task_struct and make get_signal()
> return its address - do_signal() is called only in the code that does
> assorted returns to userland...

We have better uses for random buffers in "struct task_struct", I'd
hate to put a siginfo_t there.

The thing is, siginfo_t has that idiotic 128-byte area, but it's all
"for future expansion". I think it's some damn glibc disease - we've
seen these kinds of insane paddings before.

The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.

Sad.

                Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 20:13                       ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12 20:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
>
>> It looks like just "do_signal()" has a stack frame that is about 230
>> bytes even under normal circumstancs (largely due to "struct ksignal"
>> - which in turn is largely due to the insane 128-byte padding in
>> siginfo_t). Add a few other frames in there, and I guess that if it
>> was close before, the coredump path just makes it go off.
>
> We could, in principle, put it into task_struct and make get_signal()
> return its address - do_signal() is called only in the code that does
> assorted returns to userland...

We have better uses for random buffers in "struct task_struct", I'd
hate to put a siginfo_t there.

The thing is, siginfo_t has that idiotic 128-byte area, but it's all
"for future expansion". I think it's some damn glibc disease - we've
seen these kinds of insane paddings before.

The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.

Sad.

                Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 20:13                       ` Linus Torvalds
@ 2014-02-12 21:14                         ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 12:13:19PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> >
> >> It looks like just "do_signal()" has a stack frame that is about 230
> >> bytes even under normal circumstancs (largely due to "struct ksignal"
> >> - which in turn is largely due to the insane 128-byte padding in
> >> siginfo_t). Add a few other frames in there, and I guess that if it
> >> was close before, the coredump path just makes it go off.
> >
> > We could, in principle, put it into task_struct and make get_signal()
> > return its address - do_signal() is called only in the code that does
> > assorted returns to userland...
> 
> We have better uses for random buffers in "struct task_struct", I'd
> hate to put a siginfo_t there.

*nod*

> The thing is, siginfo_t has that idiotic 128-byte area, but it's all
> "for future expansion". I think it's some damn glibc disease - we've
> seen these kinds of insane paddings before.
> 
> The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.
> 
> Sad.

Umm...  What if we delay __sigqueue_free()?  After all, that's where the
fat sucker normally comes from.  That way we might get away with much
smaller structure on stack...

Just introduce a small structure that would contain signr, uid, pid and
pointer to struct sigqueue.  And pass a pointer to _that_ all the way down
to collect_signal().  Pointer's NULL == it's SI_USER with signr/uid/pid
from the small struct and all other fields are zero.  Pointer isn't NULL -
use &small_struct->p->info.  And have struct sigqueue actually freed
via task_work_add() in that case.

Do you see any fundamental problems with that?  Looks like it would be
faster as well - less copying involved...



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 21:14                         ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 12:13:19PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> >
> >> It looks like just "do_signal()" has a stack frame that is about 230
> >> bytes even under normal circumstancs (largely due to "struct ksignal"
> >> - which in turn is largely due to the insane 128-byte padding in
> >> siginfo_t). Add a few other frames in there, and I guess that if it
> >> was close before, the coredump path just makes it go off.
> >
> > We could, in principle, put it into task_struct and make get_signal()
> > return its address - do_signal() is called only in the code that does
> > assorted returns to userland...
> 
> We have better uses for random buffers in "struct task_struct", I'd
> hate to put a siginfo_t there.

*nod*

> The thing is, siginfo_t has that idiotic 128-byte area, but it's all
> "for future expansion". I think it's some damn glibc disease - we've
> seen these kinds of insane paddings before.
> 
> The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.
> 
> Sad.

Umm...  What if we delay __sigqueue_free()?  After all, that's where the
fat sucker normally comes from.  That way we might get away with much
smaller structure on stack...

Just introduce a small structure that would contain signr, uid, pid and
pointer to struct sigqueue.  And pass a pointer to _that_ all the way down
to collect_signal().  Pointer's NULL == it's SI_USER with signr/uid/pid
from the small struct and all other fields are zero.  Pointer isn't NULL -
use &small_struct->p->info.  And have struct sigqueue actually freed
via task_work_add() in that case.

Do you see any fundamental problems with that?  Looks like it would be
faster as well - less copying involved...


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 14:25                       ` Dave Jones
@ 2014-02-12 21:14                         ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12 21:14 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 09:25:38AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
>  > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>  > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>  > > 
>  > >  > None of the XFS code disables interrupts in that path, not does is
>  > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > >  > this point and I know that the standard (non stacked) IO stack can
>  > >  > consume >3kb of stack space when it gets down to having to do memory
>  > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > >  > drivers. Stack overruns typically show up with symptoms like we are
>  > >  > seeing.
>  > >  > ..
>  > >  > 
>  > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > >  > turn on stack overrun detection?
>  > > 
>  > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>  > 
>  > That only checks stack usage when an interrupt is taken. If no
>  > interrupts are taken when stack usage is within 128 bytes of
>  > overflow, then it doesn't catch it.
>  > 
>  > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
>  > stack usage of a process via canary overwrites and it records it in
>  > do_exit().
> 
> I had that on too. The only message from it came from quite a while
> before the trace that happened overnight..

Right, it won't capture an overrun at the point in time an overrun
occurs, either, because it only checks when the process exits. But
it does tell you what stack usage is being seen, as this:

> [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
> [12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

is a pretty a good indication that trinity is at risk of stack
overuns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 21:14                         ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-12 21:14 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 09:25:38AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
>  > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>  > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>  > > 
>  > >  > None of the XFS code disables interrupts in that path, not does is
>  > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > >  > this point and I know that the standard (non stacked) IO stack can
>  > >  > consume >3kb of stack space when it gets down to having to do memory
>  > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > >  > drivers. Stack overruns typically show up with symptoms like we are
>  > >  > seeing.
>  > >  > ..
>  > >  > 
>  > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > >  > turn on stack overrun detection?
>  > > 
>  > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>  > 
>  > That only checks stack usage when an interrupt is taken. If no
>  > interrupts are taken when stack usage is within 128 bytes of
>  > overflow, then it doesn't catch it.
>  > 
>  > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
>  > stack usage of a process via canary overwrites and it records it in
>  > do_exit().
> 
> I had that on too. The only message from it came from quite a while
> before the trace that happened overnight..

Right, it won't capture an overrun at the point in time an overrun
occurs, either, because it only checks when the process exits. But
it does tell you what stack usage is being seen, as this:

> [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
> [12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

is a pretty a good indication that trinity is at risk of stack
overuns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 21:14                         ` Al Viro
@ 2014-02-12 21:32                           ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12 21:32 UTC (permalink / raw)
  To: Al Viro, Oleg Nesterov
  Cc: Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> fat sucker normally comes from.  That way we might get away with much
> smaller structure on stack...

Sounds like the RightThing(tm) to do to me, and I don't see why it
wouldn't work.

We'd have to teach each user of "dequeue_signal()" to free the siginfo
thing. Which shouldn't be too bad - I think we've collected all of
that into generic code, and there isn't the mass or architecture code
that knows about these things any more. But there are a few odd
drivers etc and signalfd. I didn't look at what the lifetimes were.

Adding Oleg to the cc, since any time we touch any of that code, he
should be involved. Oleg - the issue is the biggish size of 'struct
ksignal' on stack, brought on by the silly "put a whole siginfo_t in
it".

            Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 21:32                           ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-12 21:32 UTC (permalink / raw)
  To: Al Viro, Oleg Nesterov; +Cc: Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> fat sucker normally comes from.  That way we might get away with much
> smaller structure on stack...

Sounds like the RightThing(tm) to do to me, and I don't see why it
wouldn't work.

We'd have to teach each user of "dequeue_signal()" to free the siginfo
thing. Which shouldn't be too bad - I think we've collected all of
that into generic code, and there isn't the mass or architecture code
that knows about these things any more. But there are a few odd
drivers etc and signalfd. I didn't look at what the lifetimes were.

Adding Oleg to the cc, since any time we touch any of that code, he
should be involved. Oleg - the issue is the biggish size of 'struct
ksignal' on stack, brought on by the silly "put a whole siginfo_t in
it".

            Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 21:32                           ` Linus Torvalds
@ 2014-02-12 21:44                             ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
> 
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.
> 
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd. I didn't look at what the lifetimes were.

Only signalfd, AFAICS.  And there we'd want to use the same small structure -
it's used in
        do {
                ret = signalfd_dequeue(ctx, &info, nonblock);
                if (unlikely(ret <= 0))
                        break;
                ret = signalfd_copyinfo(siginfo, &info);
                if (ret < 0)
                        break;
                siginfo++;
                total += ret;
                nonblock = 1;
        } while (--count);
and using a smaller struct would actually speed the things up - skips one
copying.  sigqueue would be freed as soon as we'd done signalfd_copyinfo()
(if not by signalfd_copyinfo() itself).

I'll try to put something along those lines together, if you or Oleg don't
do it first.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-12 21:44                             ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-12 21:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Oleg Nesterov, xfs, Dave Jones, Linux Kernel

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
> 
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.
> 
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd. I didn't look at what the lifetimes were.

Only signalfd, AFAICS.  And there we'd want to use the same small structure -
it's used in
        do {
                ret = signalfd_dequeue(ctx, &info, nonblock);
                if (unlikely(ret <= 0))
                        break;
                ret = signalfd_copyinfo(siginfo, &info);
                if (ret < 0)
                        break;
                siginfo++;
                total += ret;
                nonblock = 1;
        } while (--count);
and using a smaller struct would actually speed the things up - skips one
copying.  sigqueue would be freed as soon as we'd done signalfd_copyinfo()
(if not by signalfd_copyinfo() itself).

I'll try to put something along those lines together, if you or Oleg don't
do it first.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 21:32                           ` Linus Torvalds
@ 2014-02-13 17:40                             ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On 02/12, Linus Torvalds wrote:
>
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
>
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.

Probably... I'll try to reply tomorrow.

> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing.

And we should be careful with SIGQUEUE_PREALLOC, at least
collect_signal() should not do list_del_init()... Plus we need to
handle the SEND_SIG_FORCED-like case.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-13 17:40                             ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 17:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Dave Jones

On 02/12, Linus Torvalds wrote:
>
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
>
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.

Probably... I'll try to reply tomorrow.

> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing.

And we should be careful with SIGQUEUE_PREALLOC, at least
collect_signal() should not do list_del_init()... Plus we need to
handle the SEND_SIG_FORCED-like case.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 17:40                             ` Oleg Nesterov
@ 2014-02-13 17:58                               ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-13 17:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> And we should be careful with SIGQUEUE_PREALLOC, at least
> collect_signal() should not do list_del_init()... Plus we need to
> handle the SEND_SIG_FORCED-like case.

I don't think the users need to care. They'd just call
"sigqueue_free()" not knowing about our preallocations etc. That kind
of detail should be confined to inside signal.c.

But there really aren't that many users. There's a couple of
"dequeue_signal_lock()" users, but they don't actually *want* the
siginfo at all (they're kernel threads), so we can just make that
function free the siginfo immediately (and get rid of the totally
unnecessay kernel stack allocation). And outside of signal.c only
signalfd uses "dequeue_signal()" itself, and that would be the only
one that would need to be taught to use (in signalfd_copyinfo()) and
then free the sigqueue entry.

So it really looks like the right thing to do, and fairly
straightforward. But I'm leaving the coding proof to Al, since he
already offered ;)

             Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-13 17:58                               ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-13 17:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Dave Jones

On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> And we should be careful with SIGQUEUE_PREALLOC, at least
> collect_signal() should not do list_del_init()... Plus we need to
> handle the SEND_SIG_FORCED-like case.

I don't think the users need to care. They'd just call
"sigqueue_free()" not knowing about our preallocations etc. That kind
of detail should be confined to inside signal.c.

But there really aren't that many users. There's a couple of
"dequeue_signal_lock()" users, but they don't actually *want* the
siginfo at all (they're kernel threads), so we can just make that
function free the siginfo immediately (and get rid of the totally
unnecessay kernel stack allocation). And outside of signal.c only
signalfd uses "dequeue_signal()" itself, and that would be the only
one that would need to be taught to use (in signalfd_copyinfo()) and
then free the sigqueue entry.

So it really looks like the right thing to do, and fairly
straightforward. But I'm leaving the coding proof to Al, since he
already offered ;)

             Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 17:58                               ` Linus Torvalds
@ 2014-02-13 18:10                                 ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On 02/13, Linus Torvalds wrote:
>
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
>
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc.

Yes, but we need to be careful to avoid the races with
release_posix_timer().

> That kind
> of detail should be confined to inside signal.c.

Yes, sure.

> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation).

Yes. Perhaps this helper should be changed/renamed. And perhaps we
can even change __send_signal() to avoid __sigqueue_alloc() if
PF_KTHREAD. Or sig == SIGKILL.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-13 18:10                                 ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Dave Jones

On 02/13, Linus Torvalds wrote:
>
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
>
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc.

Yes, but we need to be careful to avoid the races with
release_posix_timer().

> That kind
> of detail should be confined to inside signal.c.

Yes, sure.

> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation).

Yes. Perhaps this helper should be changed/renamed. And perhaps we
can even change __send_signal() to avoid __sigqueue_alloc() if
PF_KTHREAD. Or sig == SIGKILL.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 18:10                                 ` Oleg Nesterov
@ 2014-02-13 18:37                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On 02/13, Oleg Nesterov wrote:
>
> On 02/13, Linus Torvalds wrote:
> >
> > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > And we should be careful with SIGQUEUE_PREALLOC, at least
> > > collect_signal() should not do list_del_init()... Plus we need to
> > > handle the SEND_SIG_FORCED-like case.
> >
> > I don't think the users need to care. They'd just call
> > "sigqueue_free()" not knowing about our preallocations etc.
>
> Yes, but we need to be careful to avoid the races with
> release_posix_timer().

Plus we need to delay do_schedule_next_timer() as well. But this
is probably good because we can avoid unlock/lock(siglock) in
dequeue_signal().

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-13 18:37                                   ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-13 18:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Dave Jones

On 02/13, Oleg Nesterov wrote:
>
> On 02/13, Linus Torvalds wrote:
> >
> > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > And we should be careful with SIGQUEUE_PREALLOC, at least
> > > collect_signal() should not do list_del_init()... Plus we need to
> > > handle the SEND_SIG_FORCED-like case.
> >
> > I don't think the users need to care. They'd just call
> > "sigqueue_free()" not knowing about our preallocations etc.
>
> Yes, but we need to be careful to avoid the races with
> release_posix_timer().

Plus we need to delay do_schedule_next_timer() as well. But this
is probably good because we can avoid unlock/lock(siglock) in
dequeue_signal().

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 21:44                             ` Al Viro
@ 2014-02-13 20:51                               ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-13 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:

> I'll try to put something along those lines together, if you or Oleg don't
> do it first.

OK, having looked at that stuff...

1) things become much more compact if we finish conversion to get_signal()
first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
pair of local variables; switching to ksignal will be neutral wrt stack
footprint (it just gathers those two in one struct) *and* we are getting
rid of passing struct siginfo * around.  With that done, we can change
struct ksignal ->info with zero impact on the code in arch/*, and conversion
makes sense on its own.  In the mainline we have it done for alpha, arm,
openrisc, sparc and x86.  I've just put together preliminary (and completely
untested) patches for arm64, m68k and um; doing the rest won't take long, but
they'll obviously need to be tested.  It's a fairly safe conversion;
I'd expect the worst bugs to be typos.

2) struct small_siginfo ought to go into linux/signal.h.  Contains signal
number, si_code (usually 0), uid, pid and (often NULL) pointer to
struct sigqueue.  Overall: 20 or 24 bytes.  In struct coredump_params we
should replace ->siginfo with pointer to struct small_siginfo.  Ditto for
task_struct ->last_siginfo.  In struct ksignal ->info becomes
struct small_siginfo.

copy_siginfo_to_user{,32}() gets switched to struct small_siginfo *, so does
do_coredump(), ptrace_signal(), ptrace_stop(), ptrace_getsiginfo(),
ptrace_setsiginfo(), collect_signal(), __dequeue_signal(), dequeue_signal(),
signalfd_dequeue() and do_sigtimedwait().

New primitive: assign_sigqueue(small_info, sigqueue).  Frees
small_info->q and assigns a new value to it.

When ptrace_setsiginfo() is given a non-plain siginfo (si_code != SI_USER,
that is), it should a allocate struct sigqueue, copy the sucker there and
give it to assign_sigqueue().  Plain ones just get uid/pid/signo copied
and NULL passed to assign_sigqueue().

code in ptrace_signal() under if (signr != info->si_signo) should
start with assign_sigqueue(info, NULL).

specific_send_sig_info() in ptrace_signal() is rather clumsy to handle; not
sure what's the best way to deal with that.  In any case, ptrace_signal()
deciding to return 0 should take care of info->q - either copy and free the
original, or actually try to reuse the sucker.  Either way, info->q should
become NULL.

get_signal_to_deliver() should do assign_sigqueue(info, NULL) before
the call of do_group_exit().

arch_ptrace_stop_needed() can be left as-is; it's a macro and none of the
instances look at the info argument at all.  The same goes for
arch_ptrace_stop().

signalfd_read() should do assign_sigqueue(&info, NULL) right after
signalfd_copyinfo().

rt_sigtimedwait(2) and its compat variant should do the same right after
copy_siginfo_to_user{,32}().

TP'ed stuff is a mess, as usual.  AFAICS, TP_STORE_SIGINFO() needs to
be split into a variant taking siginfo (on trace_signal_generate side)
and one taking small_siginfo (for trace_signal_deliver).

get_signal() should do assign_sigqueue(&ksig->info, NULL) if it returns
false and use task_work_add() to schedule info->q for freeing otherwise.


Overall, aside of the need to complete arch/* conversion, the only unfinished
part is this:
        /* If the (new) signal is now blocked, requeue it.  */
        if (sigismember(&current->blocked, signr)) {
                specific_send_sig_info(signr, info, current);
                signr = 0;
        }
in the end of ptrace_signal().  Sure, we can add a local struct siginfo
in that if, fill it if needed and pass its address or &info->q->info
to specific_send_sig_info(), but that feels kludgy...  Hell knows, I'll
look a bit more into that.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-13 20:51                               ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-13 20:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Oleg Nesterov, xfs, Dave Jones, Linux Kernel

On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:

> I'll try to put something along those lines together, if you or Oleg don't
> do it first.

OK, having looked at that stuff...

1) things become much more compact if we finish conversion to get_signal()
first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
pair of local variables; switching to ksignal will be neutral wrt stack
footprint (it just gathers those two in one struct) *and* we are getting
rid of passing struct siginfo * around.  With that done, we can change
struct ksignal ->info with zero impact on the code in arch/*, and conversion
makes sense on its own.  In the mainline we have it done for alpha, arm,
openrisc, sparc and x86.  I've just put together preliminary (and completely
untested) patches for arm64, m68k and um; doing the rest won't take long, but
they'll obviously need to be tested.  It's a fairly safe conversion;
I'd expect the worst bugs to be typos.

2) struct small_siginfo ought to go into linux/signal.h.  Contains signal
number, si_code (usually 0), uid, pid and (often NULL) pointer to
struct sigqueue.  Overall: 20 or 24 bytes.  In struct coredump_params we
should replace ->siginfo with pointer to struct small_siginfo.  Ditto for
task_struct ->last_siginfo.  In struct ksignal ->info becomes
struct small_siginfo.

copy_siginfo_to_user{,32}() gets switched to struct small_siginfo *, so does
do_coredump(), ptrace_signal(), ptrace_stop(), ptrace_getsiginfo(),
ptrace_setsiginfo(), collect_signal(), __dequeue_signal(), dequeue_signal(),
signalfd_dequeue() and do_sigtimedwait().

New primitive: assign_sigqueue(small_info, sigqueue).  Frees
small_info->q and assigns a new value to it.

When ptrace_setsiginfo() is given a non-plain siginfo (si_code != SI_USER,
that is), it should a allocate struct sigqueue, copy the sucker there and
give it to assign_sigqueue().  Plain ones just get uid/pid/signo copied
and NULL passed to assign_sigqueue().

code in ptrace_signal() under if (signr != info->si_signo) should
start with assign_sigqueue(info, NULL).

specific_send_sig_info() in ptrace_signal() is rather clumsy to handle; not
sure what's the best way to deal with that.  In any case, ptrace_signal()
deciding to return 0 should take care of info->q - either copy and free the
original, or actually try to reuse the sucker.  Either way, info->q should
become NULL.

get_signal_to_deliver() should do assign_sigqueue(info, NULL) before
the call of do_group_exit().

arch_ptrace_stop_needed() can be left as-is; it's a macro and none of the
instances look at the info argument at all.  The same goes for
arch_ptrace_stop().

signalfd_read() should do assign_sigqueue(&info, NULL) right after
signalfd_copyinfo().

rt_sigtimedwait(2) and its compat variant should do the same right after
copy_siginfo_to_user{,32}().

TP'ed stuff is a mess, as usual.  AFAICS, TP_STORE_SIGINFO() needs to
be split into a variant taking siginfo (on trace_signal_generate side)
and one taking small_siginfo (for trace_signal_deliver).

get_signal() should do assign_sigqueue(&ksig->info, NULL) if it returns
false and use task_work_add() to schedule info->q for freeing otherwise.


Overall, aside of the need to complete arch/* conversion, the only unfinished
part is this:
        /* If the (new) signal is now blocked, requeue it.  */
        if (sigismember(&current->blocked, signr)) {
                specific_send_sig_info(signr, info, current);
                signr = 0;
        }
in the end of ptrace_signal().  Sure, we can add a local struct siginfo
in that if, fill it if needed and pass its address or &info->q->info
to specific_send_sig_info(), but that feels kludgy...  Hell knows, I'll
look a bit more into that.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 20:51                               ` Al Viro
@ 2014-02-14  0:09                                 ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14  0:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
> pair of local variables; switching to ksignal will be neutral wrt stack
> footprint (it just gathers those two in one struct) *and* we are getting
> rid of passing struct siginfo * around.  With that done, we can change
> struct ksignal ->info with zero impact on the code in arch/*, and conversion
> makes sense on its own.  In the mainline we have it done for alpha, arm,
> openrisc, sparc and x86.  I've just put together preliminary (and completely
> untested) patches for arm64, m68k and um; doing the rest won't take long, but
> they'll obviously need to be tested.  It's a fairly safe conversion;
> I'd expect the worst bugs to be typos.

OK, there's a couple of tricks that allow to reorder that.  First of all,
temporary config symbol (ARCH_USES_KSIGNAL) selecting that stuff; if it's
not selected, we just have #define small_siginfo siginfo and
#define assign_sigqueue(info, q) WARN_ON((q) != NULL)
and that's it - most of the changes in core kernel consist of s/siginfo/small_&/
and non-trivial ones are under ifdef CONFIG_ARCH_USES_KSIGNAL for the time
being.  Small ifdefs, at that...

Once all architectures get converted, we'll just kill that config symbol and
make ifdefs unconditional.

Another is that we don't need to bother with task_work_add() at all; on
converted architectures we have signal_setup_done(..., ksig, ...) called after
each successful get_signal() (the first argument of signal_setup_done() is
"has sigframe setup failed" flag; we get there regardless of success or
failure of setup_...frame()).  So we can just have that sucker do
assign_sigqueue(&ksig->info, NULL) and be done with that - all this delayed
freeing is explicit now (and we don't get locking overhead, etc. of
task_work_add()).

If what I've got survives the local beating, I'll put it into signal.git,
throw the arch conversions I could test in there and ask on linux-arch for
help with the missing ones.  With any luck we'll get the full set by the
next merge window, at which point we'll be able to kill ifdefs.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14  0:09                                 ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14  0:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Oleg Nesterov, xfs, Dave Jones, Linux Kernel

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
> pair of local variables; switching to ksignal will be neutral wrt stack
> footprint (it just gathers those two in one struct) *and* we are getting
> rid of passing struct siginfo * around.  With that done, we can change
> struct ksignal ->info with zero impact on the code in arch/*, and conversion
> makes sense on its own.  In the mainline we have it done for alpha, arm,
> openrisc, sparc and x86.  I've just put together preliminary (and completely
> untested) patches for arm64, m68k and um; doing the rest won't take long, but
> they'll obviously need to be tested.  It's a fairly safe conversion;
> I'd expect the worst bugs to be typos.

OK, there's a couple of tricks that allow to reorder that.  First of all,
temporary config symbol (ARCH_USES_KSIGNAL) selecting that stuff; if it's
not selected, we just have #define small_siginfo siginfo and
#define assign_sigqueue(info, q) WARN_ON((q) != NULL)
and that's it - most of the changes in core kernel consist of s/siginfo/small_&/
and non-trivial ones are under ifdef CONFIG_ARCH_USES_KSIGNAL for the time
being.  Small ifdefs, at that...

Once all architectures get converted, we'll just kill that config symbol and
make ifdefs unconditional.

Another is that we don't need to bother with task_work_add() at all; on
converted architectures we have signal_setup_done(..., ksig, ...) called after
each successful get_signal() (the first argument of signal_setup_done() is
"has sigframe setup failed" flag; we get there regardless of success or
failure of setup_...frame()).  So we can just have that sucker do
assign_sigqueue(&ksig->info, NULL) and be done with that - all this delayed
freeing is explicit now (and we don't get locking overhead, etc. of
task_work_add()).

If what I've got survives the local beating, I'll put it into signal.git,
throw the arch conversions I could test in there and ask on linux-arch for
help with the missing ones.  With any luck we'll get the full set by the
next merge window, at which point we'll be able to kill ifdefs.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12  7:18                     ` Dave Chinner
@ 2014-02-14  0:24                       ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-14  0:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 06:18:29PM +1100, Dave Chinner wrote:
> > It looks like just "do_signal()" has a stack frame that is about 230
> > bytes even under normal circumstancs (largely due to "struct ksignal"
> > - which in turn is largely due to the insane 128-byte padding in
> > siginfo_t). Add a few other frames in there, and I guess that if it
> > was close before, the coredump path just makes it go off.
> 
> Yup. But it's when you see this sort of thing you realise that
> almost no GFP_KERNEL memory allocation is really safe from stack
> overruns, though:
> 
>   0)     6064      64   update_group_power+0x2c/0x270
>   1)     6000     384   find_busiest_group+0x10a/0x8b0
>   2)     5616     288   load_balance+0x165/0x870
>   3)     5328      96   idle_balance+0x106/0x1b0
>   4)     5232     112   __schedule+0x7b6/0x7e0
>   5)     5120      16   schedule+0x29/0x70
>   6)     5104     176   percpu_ida_alloc+0x1b3/0x3d0
>   7)     4928      32   blk_mq_wait_for_tags+0x1f/0x40
>   8)     4896      80   blk_mq_alloc_request_pinned+0x4e/0x110
>   9)     4816     128   blk_mq_make_request+0x42b/0x510
>  10)     4688      48   generic_make_request+0xc0/0x110
>  11)     4640      96   submit_bio+0x71/0x120
>  12)     4544     192   _xfs_buf_ioapply+0x2cc/0x3b0
>  13)     4352      48   xfs_buf_iorequest+0x6f/0xc0
>  14)     4304      32   xlog_bdstrat+0x23/0x60
>  15)     4272      96   xlog_sync+0x3a4/0x5c0
>  16)     4176      48   xlog_state_release_iclog+0x121/0x130
>  17)     4128     192   xlog_write+0x700/0x7c0
>  18)     3936     192   xlog_cil_push+0x2ae/0x3d0
>  19)     3744     128   xlog_cil_force_lsn+0x1b8/0x1e0
>  20)     3616     144   _xfs_log_force_lsn+0x7c/0x300
>  21)     3472      48   xfs_log_force_lsn+0x3b/0xa0
>  22)     3424     112   xfs_iunpin_wait+0xd7/0x160
>  23)     3312      80   xfs_reclaim_inode+0xd0/0x350
>  24)     3232     432   xfs_reclaim_inodes_ag+0x277/0x3d0
>  25)     2800      48   xfs_reclaim_inodes_nr+0x33/0x40
>  26)     2752      16   xfs_fs_free_cached_objects+0x15/0x20
>  27)     2736      80   super_cache_scan+0x169/0x170
>  28)     2656     160   shrink_slab_node+0x133/0x290
>  29)     2496      80   shrink_slab+0x122/0x160
>  30)     2416     112   do_try_to_free_pages+0x1de/0x360
>  31)     2304     192   try_to_free_pages+0x110/0x190
>  32)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
>  33)     1856      80   alloc_pages_current+0xb2/0x170
>  34)     1776      64   new_slab+0x265/0x2e0
>  35)     1712     240   __slab_alloc+0x2fb/0x4c4
>  36)     1472      80   kmem_cache_alloc+0x10b/0x140
> 
> That's almost 4700 bytes of stack usage from the
> kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
> Yes, it's an extreme case, but if you're looking for a smoking
> gun.... :/
> 
> I can fix this one easily - we already have a workqueue for doing
> async log pushes (will split the stack between xlog_cil_force_lsn
> and xlog_cil_push), but the reason we haven't used it for synchronous
> log forces is that screws up fsync performance on CFQ. We don't
> recommend CFQ with XFS anyway, so I think I'll make this change
> anyway.

Dave, the patch below should chop off the stack usage from
xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
Can you given this a run?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: always do log forces via the workqueue

From: Dave Chinner <dchinner@redhat.com>

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b57a8e0..7e54553 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -499,13 +499,6 @@ xlog_cil_push(
 	cil->xc_ctx = new_ctx;
 
 	/*
-	 * mirror the new sequence into the cil structure so that we can do
-	 * unlocked checks against the current sequence in log forces without
-	 * risking deferencing a freed context pointer.
-	 */
-	cil->xc_current_sequence = new_ctx->sequence;
-
-	/*
 	 * The switch is now done, so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
@@ -523,8 +516,15 @@ xlog_cil_push(
 	 * Hence we need to add this context to the committing context list so
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
+	 *
+	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * structure atomically with the addition of this sequence to the
+	 * committing list. This also ensures that we can do unlocked checks
+	 * against the current sequence in log forces without risking
+	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
+	cil->xc_current_sequence = new_ctx->sequence;
 	list_add(&ctx->committing, &cil->xc_committing);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
@@ -662,8 +662,14 @@ xlog_cil_push_background(
 
 }
 
+/*
+ * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
+ * number that is passed. When it returns, the work will be queued for
+ * @push_seq, but it won't be completed. The caller is expected to do any
+ * waiting for push_seq to complete if it is required.
+ */
 static void
-xlog_cil_push_foreground(
+xlog_cil_push_now(
 	struct xlog	*log,
 	xfs_lsn_t	push_seq)
 {
@@ -688,10 +694,8 @@ xlog_cil_push_foreground(
 	}
 
 	cil->xc_push_seq = push_seq;
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
-
-	/* do the push now */
-	xlog_cil_push(log);
 }
 
 bool
@@ -795,7 +799,8 @@ xlog_cil_force_lsn(
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
-	xlog_cil_push_foreground(log, sequence);
+restart:
+	xlog_cil_push_now(log, sequence);
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -803,7 +808,6 @@ xlog_cil_force_lsn(
 	 * before allowing the force of push_seq to go ahead. Hence block
 	 * on commits for those as well.
 	 */
-restart:
 	spin_lock(&cil->xc_push_lock);
 	list_for_each_entry(ctx, &cil->xc_committing, committing) {
 		if (ctx->sequence > sequence)
@@ -821,6 +825,28 @@ restart:
 		/* found it! */
 		commit_lsn = ctx->commit_lsn;
 	}
+
+	/*
+	 * The call to xlog_cil_push_now() executes the push in the background.
+	 * Hence by the time we have got here it our sequence may not have been
+	 * pushed yet. This is true if the current sequence still matches the
+	 * push sequence after the above wait loop and the CIL still contains
+	 * dirty objects.
+	 *
+	 * When the push occurs, it will empty the CIL and
+	 * atomically increment the currect sequence past the push sequence and
+	 * move it into the committing list. Of course, if the CIL is clean at
+	 * the time of the push, it won't have pushed the CIL at all, so in that
+	 * case we should try the push for this sequence again from the start
+	 * just in case.
+	 */
+
+	if (sequence == cil->xc_current_sequence &&
+	    !list_empty(&cil->xc_cil)) {
+		spin_unlock(&cil->xc_push_lock);
+		goto restart;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 	return commit_lsn;
 }

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14  0:24                       ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-14  0:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, xfs, Eric Sandeen, Al Viro, Linux Kernel

On Wed, Feb 12, 2014 at 06:18:29PM +1100, Dave Chinner wrote:
> > It looks like just "do_signal()" has a stack frame that is about 230
> > bytes even under normal circumstancs (largely due to "struct ksignal"
> > - which in turn is largely due to the insane 128-byte padding in
> > siginfo_t). Add a few other frames in there, and I guess that if it
> > was close before, the coredump path just makes it go off.
> 
> Yup. But it's when you see this sort of thing you realise that
> almost no GFP_KERNEL memory allocation is really safe from stack
> overruns, though:
> 
>   0)     6064      64   update_group_power+0x2c/0x270
>   1)     6000     384   find_busiest_group+0x10a/0x8b0
>   2)     5616     288   load_balance+0x165/0x870
>   3)     5328      96   idle_balance+0x106/0x1b0
>   4)     5232     112   __schedule+0x7b6/0x7e0
>   5)     5120      16   schedule+0x29/0x70
>   6)     5104     176   percpu_ida_alloc+0x1b3/0x3d0
>   7)     4928      32   blk_mq_wait_for_tags+0x1f/0x40
>   8)     4896      80   blk_mq_alloc_request_pinned+0x4e/0x110
>   9)     4816     128   blk_mq_make_request+0x42b/0x510
>  10)     4688      48   generic_make_request+0xc0/0x110
>  11)     4640      96   submit_bio+0x71/0x120
>  12)     4544     192   _xfs_buf_ioapply+0x2cc/0x3b0
>  13)     4352      48   xfs_buf_iorequest+0x6f/0xc0
>  14)     4304      32   xlog_bdstrat+0x23/0x60
>  15)     4272      96   xlog_sync+0x3a4/0x5c0
>  16)     4176      48   xlog_state_release_iclog+0x121/0x130
>  17)     4128     192   xlog_write+0x700/0x7c0
>  18)     3936     192   xlog_cil_push+0x2ae/0x3d0
>  19)     3744     128   xlog_cil_force_lsn+0x1b8/0x1e0
>  20)     3616     144   _xfs_log_force_lsn+0x7c/0x300
>  21)     3472      48   xfs_log_force_lsn+0x3b/0xa0
>  22)     3424     112   xfs_iunpin_wait+0xd7/0x160
>  23)     3312      80   xfs_reclaim_inode+0xd0/0x350
>  24)     3232     432   xfs_reclaim_inodes_ag+0x277/0x3d0
>  25)     2800      48   xfs_reclaim_inodes_nr+0x33/0x40
>  26)     2752      16   xfs_fs_free_cached_objects+0x15/0x20
>  27)     2736      80   super_cache_scan+0x169/0x170
>  28)     2656     160   shrink_slab_node+0x133/0x290
>  29)     2496      80   shrink_slab+0x122/0x160
>  30)     2416     112   do_try_to_free_pages+0x1de/0x360
>  31)     2304     192   try_to_free_pages+0x110/0x190
>  32)     2112     256   __alloc_pages_nodemask+0x5a2/0x8e0
>  33)     1856      80   alloc_pages_current+0xb2/0x170
>  34)     1776      64   new_slab+0x265/0x2e0
>  35)     1712     240   __slab_alloc+0x2fb/0x4c4
>  36)     1472      80   kmem_cache_alloc+0x10b/0x140
> 
> That's almost 4700 bytes of stack usage from the
> kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
> Yes, it's an extreme case, but if you're looking for a smoking
> gun.... :/
> 
> I can fix this one easily - we already have a workqueue for doing
> async log pushes (will split the stack between xlog_cil_force_lsn
> and xlog_cil_push), but the reason we haven't used it for synchronous
> log forces is that screws up fsync performance on CFQ. We don't
> recommend CFQ with XFS anyway, so I think I'll make this change
> anyway.

Dave, the patch below should chop off the stack usage from
xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
Can you given this a run?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: always do log forces via the workqueue

From: Dave Chinner <dchinner@redhat.com>

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b57a8e0..7e54553 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -499,13 +499,6 @@ xlog_cil_push(
 	cil->xc_ctx = new_ctx;
 
 	/*
-	 * mirror the new sequence into the cil structure so that we can do
-	 * unlocked checks against the current sequence in log forces without
-	 * risking deferencing a freed context pointer.
-	 */
-	cil->xc_current_sequence = new_ctx->sequence;
-
-	/*
 	 * The switch is now done, so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
@@ -523,8 +516,15 @@ xlog_cil_push(
 	 * Hence we need to add this context to the committing context list so
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
+	 *
+	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * structure atomically with the addition of this sequence to the
+	 * committing list. This also ensures that we can do unlocked checks
+	 * against the current sequence in log forces without risking
+	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
+	cil->xc_current_sequence = new_ctx->sequence;
 	list_add(&ctx->committing, &cil->xc_committing);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
@@ -662,8 +662,14 @@ xlog_cil_push_background(
 
 }
 
+/*
+ * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
+ * number that is passed. When it returns, the work will be queued for
+ * @push_seq, but it won't be completed. The caller is expected to do any
+ * waiting for push_seq to complete if it is required.
+ */
 static void
-xlog_cil_push_foreground(
+xlog_cil_push_now(
 	struct xlog	*log,
 	xfs_lsn_t	push_seq)
 {
@@ -688,10 +694,8 @@ xlog_cil_push_foreground(
 	}
 
 	cil->xc_push_seq = push_seq;
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
-
-	/* do the push now */
-	xlog_cil_push(log);
 }
 
 bool
@@ -795,7 +799,8 @@ xlog_cil_force_lsn(
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
-	xlog_cil_push_foreground(log, sequence);
+restart:
+	xlog_cil_push_now(log, sequence);
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -803,7 +808,6 @@ xlog_cil_force_lsn(
 	 * before allowing the force of push_seq to go ahead. Hence block
 	 * on commits for those as well.
 	 */
-restart:
 	spin_lock(&cil->xc_push_lock);
 	list_for_each_entry(ctx, &cil->xc_committing, committing) {
 		if (ctx->sequence > sequence)
@@ -821,6 +825,28 @@ restart:
 		/* found it! */
 		commit_lsn = ctx->commit_lsn;
 	}
+
+	/*
+	 * The call to xlog_cil_push_now() executes the push in the background.
+	 * Hence by the time we have got here it our sequence may not have been
+	 * pushed yet. This is true if the current sequence still matches the
+	 * push sequence after the above wait loop and the CIL still contains
+	 * dirty objects.
+	 *
+	 * When the push occurs, it will empty the CIL and
+	 * atomically increment the currect sequence past the push sequence and
+	 * move it into the committing list. Of course, if the CIL is clean at
+	 * the time of the push, it won't have pushed the CIL at all, so in that
+	 * case we should try the push for this sequence again from the start
+	 * just in case.
+	 */
+
+	if (sequence == cil->xc_current_sequence &&
+	    !list_empty(&cil->xc_cil)) {
+		spin_unlock(&cil->xc_push_lock);
+		goto restart;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 	return commit_lsn;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 20:51                               ` Al Viro
@ 2014-02-14 13:25                                 ` Christoph Hellwig
  -1 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 13:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Oleg Nesterov, Dave Chinner, Dave Jones,
	Eric Sandeen, Linux Kernel, xfs, Richard Weinberger

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.

I have vague memories that Richard sent out a series to convert over all
architectures a while ago.  Hopefully he has better memory than I do.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 13:25                                 ` Christoph Hellwig
  0 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 13:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Sandeen, Oleg Nesterov, Linux Kernel, Richard Weinberger,
	Dave Jones, xfs, Linus Torvalds

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.

I have vague memories that Richard sent out a series to convert over all
architectures a while ago.  Hopefully he has better memory than I do.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 13:25                                 ` Christoph Hellwig
@ 2014-02-14 13:29                                   ` Richard Weinberger
  -1 siblings, 0 replies; 132+ messages in thread
From: Richard Weinberger @ 2014-02-14 13:29 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro
  Cc: Linus Torvalds, Oleg Nesterov, Dave Chinner, Dave Jones,
	Eric Sandeen, Linux Kernel, xfs

Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
>> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
>>
>>> I'll try to put something along those lines together, if you or Oleg don't
>>> do it first.
>>
>> OK, having looked at that stuff...
>>
>> 1) things become much more compact if we finish conversion to get_signal()
>> first.
> 
> I have vague memories that Richard sent out a series to convert over all
> architectures a while ago.  Hopefully he has better memory than I do.

Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
I think this is a good reason for me to start working on that series again.
Stay tuned.

Thanks,
//richard


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 13:29                                   ` Richard Weinberger
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Weinberger @ 2014-02-14 13:29 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro
  Cc: Eric Sandeen, Oleg Nesterov, Linux Kernel, Dave Jones, xfs,
	Linus Torvalds

Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
>> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
>>
>>> I'll try to put something along those lines together, if you or Oleg don't
>>> do it first.
>>
>> OK, having looked at that stuff...
>>
>> 1) things become much more compact if we finish conversion to get_signal()
>> first.
> 
> I have vague memories that Richard sent out a series to convert over all
> architectures a while ago.  Hopefully he has better memory than I do.

Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
I think this is a good reason for me to start working on that series again.
Stay tuned.

Thanks,
//richard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 13:29                                   ` Richard Weinberger
@ 2014-02-14 15:20                                     ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 15:20 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Christoph Hellwig, Linus Torvalds, Oleg Nesterov, Dave Chinner,
	Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Fri, Feb 14, 2014 at 02:29:30PM +0100, Richard Weinberger wrote:
> Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> > On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> >> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> >>
> >>> I'll try to put something along those lines together, if you or Oleg don't
> >>> do it first.
> >>
> >> OK, having looked at that stuff...
> >>
> >> 1) things become much more compact if we finish conversion to get_signal()
> >> first.
> > 
> > I have vague memories that Richard sent out a series to convert over all
> > architectures a while ago.  Hopefully he has better memory than I do.
> 
> Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
> I think this is a good reason for me to start working on that series again.
> Stay tuned.

Would be great.   I have several done here, but I'll be glad to replace them
with something tested...

BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 15:20                                     ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 15:20 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric Sandeen, Oleg Nesterov, Linux Kernel, Christoph Hellwig,
	Dave Jones, xfs, Linus Torvalds

On Fri, Feb 14, 2014 at 02:29:30PM +0100, Richard Weinberger wrote:
> Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> > On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> >> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> >>
> >>> I'll try to put something along those lines together, if you or Oleg don't
> >>> do it first.
> >>
> >> OK, having looked at that stuff...
> >>
> >> 1) things become much more compact if we finish conversion to get_signal()
> >> first.
> > 
> > I have vague memories that Richard sent out a series to convert over all
> > architectures a while ago.  Hopefully he has better memory than I do.
> 
> Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
> I think this is a good reason for me to start working on that series again.
> Stay tuned.

Would be great.   I have several done here, but I'll be glad to replace them
with something tested...

BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14  0:24                       ` Dave Chinner
@ 2014-02-14 16:01                         ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-14 16:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, Al Viro, Eric Sandeen, Linux Kernel, xfs

On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
 
 > > I can fix this one easily - we already have a workqueue for doing
 > > async log pushes (will split the stack between xlog_cil_force_lsn
 > > and xlog_cil_push), but the reason we haven't used it for synchronous
 > > log forces is that screws up fsync performance on CFQ. We don't
 > > recommend CFQ with XFS anyway, so I think I'll make this change
 > > anyway.
 > 
 > Dave, the patch below should chop off the stack usage from
 > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
 > Can you given this a run?

Looks like it's survived an overnight run..

	Dave

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:01                         ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-14 16:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs, Linus Torvalds, Al Viro, Linux Kernel

On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
 
 > > I can fix this one easily - we already have a workqueue for doing
 > > async log pushes (will split the stack between xlog_cil_force_lsn
 > > and xlog_cil_push), but the reason we haven't used it for synchronous
 > > log forces is that screws up fsync performance on CFQ. We don't
 > > recommend CFQ with XFS anyway, so I think I'll make this change
 > > anyway.
 > 
 > Dave, the patch below should chop off the stack usage from
 > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
 > Can you given this a run?

Looks like it's survived an overnight run..

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 15:20                                     ` Al Viro
@ 2014-02-14 16:08                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-14 16:08 UTC (permalink / raw)
  To: Al Viro, Andrey Vagin
  Cc: Richard Weinberger, Christoph Hellwig, Linus Torvalds,
	Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On 02/14, Al Viro wrote:
>
> BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
> separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
> It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

Yes, but checkpoint/restart tools want to dump/restore the internal part
of ->si_code too.

See also the "task_pid_vnr(current) == pid" hack in do_rt_sigqueueinfo(),
this allows to queue a SI_FROMKERNEL() siginfo.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:08                                       ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-14 16:08 UTC (permalink / raw)
  To: Al Viro, Andrey Vagin
  Cc: Eric Sandeen, Richard Weinberger, Linux Kernel, xfs,
	Christoph Hellwig, Dave Jones, Linus Torvalds

On 02/14, Al Viro wrote:
>
> BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
> separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
> It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

Yes, but checkpoint/restart tools want to dump/restore the internal part
of ->si_code too.

See also the "task_pid_vnr(current) == pid" hack in do_rt_sigqueueinfo(),
this allows to queue a SI_FROMKERNEL() siginfo.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-12 21:32                           ` Linus Torvalds
@ 2014-02-14 16:13                             ` Christoph Hellwig
  -1 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Oleg Nesterov, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd.

The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
All of these have in common that they try to handle signals in a kernel
thread (which we don't even allow by default), and that they ignore the
siginfo.  I think they could mostly be replaced by an addition to the
kthread API to allow a kthread to be killed by signals for legacy
reasons.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:13                             ` Christoph Hellwig
  0 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Oleg Nesterov, Dave Jones

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd.

The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
All of these have in common that they try to handle signals in a kernel
thread (which we don't even allow by default), and that they ignore the
siginfo.  I think they could mostly be replaced by an addition to the
kthread API to allow a kthread to be killed by signals for legacy
reasons.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 16:13                             ` Christoph Hellwig
@ 2014-02-14 16:16                               ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 16:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Oleg Nesterov, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Fri, Feb 14, 2014 at 08:13:02AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> > We'd have to teach each user of "dequeue_signal()" to free the siginfo
> > thing. Which shouldn't be too bad - I think we've collected all of
> > that into generic code, and there isn't the mass or architecture code
> > that knows about these things any more. But there are a few odd
> > drivers etc and signalfd.
> 
> The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
> All of these have in common that they try to handle signals in a kernel
> thread (which we don't even allow by default), and that they ignore the
> siginfo.  I think they could mostly be replaced by an addition to the
> kthread API to allow a kthread to be killed by signals for legacy
> reasons.

FWIW, there's a funny situation - all users of dequeue_signal_lock()
actually ignore info completely.  I'm not saying that we ought to
stop returning it, but e.g. jbd part of that patch is simply
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
                /* Put_super will send a SIGKILL and then wait on the sem.
                 */
                while (signal_pending(current) || freezing(current)) {
-                       siginfo_t info;
+                       ksiginfo_t info;
                        unsigned long signr;
 
                        if (try_to_freeze())
                                goto again;
 
                        signr = dequeue_signal_lock(current, &current->blocked, &info);
+                       dismiss_siginfo(&info);
 
                        switch(signr) {
                        case SIGSTOP:

Not complicated at all.  Where it does get complicated is ->last_siginfo and
PTRACE_SETSIGINFO - getting that reasonably clean is what I'm still fighting
right now...

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:16                               ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 16:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel, Eric Sandeen, Oleg Nesterov, xfs, Dave Jones,
	Linus Torvalds

On Fri, Feb 14, 2014 at 08:13:02AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> > We'd have to teach each user of "dequeue_signal()" to free the siginfo
> > thing. Which shouldn't be too bad - I think we've collected all of
> > that into generic code, and there isn't the mass or architecture code
> > that knows about these things any more. But there are a few odd
> > drivers etc and signalfd.
> 
> The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
> All of these have in common that they try to handle signals in a kernel
> thread (which we don't even allow by default), and that they ignore the
> siginfo.  I think they could mostly be replaced by an addition to the
> kthread API to allow a kthread to be killed by signals for legacy
> reasons.

FWIW, there's a funny situation - all users of dequeue_signal_lock()
actually ignore info completely.  I'm not saying that we ought to
stop returning it, but e.g. jbd part of that patch is simply
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
                /* Put_super will send a SIGKILL and then wait on the sem.
                 */
                while (signal_pending(current) || freezing(current)) {
-                       siginfo_t info;
+                       ksiginfo_t info;
                        unsigned long signr;
 
                        if (try_to_freeze())
                                goto again;
 
                        signr = dequeue_signal_lock(current, &current->blocked, &info);
+                       dismiss_siginfo(&info);
 
                        switch(signr) {
                        case SIGSTOP:

Not complicated at all.  Where it does get complicated is ->last_siginfo and
PTRACE_SETSIGINFO - getting that reasonably clean is what I'm still fighting
right now...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 16:16                               ` Al Viro
@ 2014-02-14 16:18                                 ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Oleg Nesterov, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

s/jbd/jffs2/, obviously.  Sorry...  And yes, nbd and usb_storage are the
same story.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:18                                 ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-14 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel, Eric Sandeen, Oleg Nesterov, xfs, Dave Jones,
	Linus Torvalds

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

s/jbd/jffs2/, obviously.  Sorry...  And yes, nbd and usb_storage are the
same story.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 16:16                               ` Al Viro
@ 2014-02-14 16:19                                 ` Christoph Hellwig
  -1 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 16:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, Oleg Nesterov, Dave Jones,
	Eric Sandeen, Linux Kernel, xfs

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> > All of these have in common that they try to handle signals in a kernel
> > thread (which we don't even allow by default), and that they ignore the
> > siginfo.  I think they could mostly be replaced by an addition to the
> > kthread API to allow a kthread to be killed by signals for legacy
> > reasons.
> 
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

Might aswell stick the discmiss into what was dequeue_signal_lock().
Which at that point should get a saner name (maybe thread_dequeue_signal ?)
and lose all argument except maybe task_struct - not that it's
nessecary, but it would mirror the other functions usually used around
it.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-14 16:19                                 ` Christoph Hellwig
  0 siblings, 0 replies; 132+ messages in thread
From: Christoph Hellwig @ 2014-02-14 16:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Sandeen, Oleg Nesterov, Linux Kernel, Christoph Hellwig,
	Dave Jones, xfs, Linus Torvalds

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> > All of these have in common that they try to handle signals in a kernel
> > thread (which we don't even allow by default), and that they ignore the
> > siginfo.  I think they could mostly be replaced by an addition to the
> > kthread API to allow a kthread to be killed by signals for legacy
> > reasons.
> 
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

Might aswell stick the discmiss into what was dequeue_signal_lock().
Which at that point should get a saner name (maybe thread_dequeue_signal ?)
and lose all argument except maybe task_struct - not that it's
nessecary, but it would mirror the other functions usually used around
it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-13 17:58                               ` Linus Torvalds
@ 2014-02-15  5:25                                 ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15  5:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Dave Chinner, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Thu, Feb 13, 2014 at 09:58:47AM -0800, Linus Torvalds wrote:
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
> 
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc. That kind
> of detail should be confined to inside signal.c.
> 
> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation). And outside of signal.c only
> signalfd uses "dequeue_signal()" itself, and that would be the only
> one that would need to be taught to use (in signalfd_copyinfo()) and
> then free the sigqueue entry.
> 
> So it really looks like the right thing to do, and fairly
> straightforward. But I'm leaving the coding proof to Al, since he
> already offered ;)

OK, _very_ preliminary patch follows.  It's uglier than it has to and it's
not in the form I would consider suitable for merge.  It doesn't depend
on conversions of architectures to use of ksignal as prereqs; instead of
that it has config symbol (ARCH_USES_KSIGNAL) that can be selected on
architectures already through that conversion (in this patch - just on
x86, since that was all I could test on at the moment; e.g. alpha and
arm could also get such selects right now).  If it isn't selected, everything
builds as usual.  With large siginfo on stack for signal path, as before.
We get ksiginfo_t as an alias for siginfo_t in that case, dismiss_siginfo()
is a no-op, etc.  If it _is_ selected, ksiginfo_t is much smaller than
siginfo_t.

Of course, we pay for that with ifdefs in places that need to understand
[k]siginfo_t guts; signalfd_copyinfo(), copy_siginfo_from_user{,32}() and
copy_ksiginfo_to_user{,32}(), tracepoint for deliver_signal(), collect_signal(),
trivial ones in dequeue_signal() and an ugly bugger in ptrace_signal().
No way around that until all architectures are converted.  Other places
get away with some siginfo_t replaced with ksiginfo_t and dismiss_siginfo()
added in some places.  arch/x86/ia32/ia32_signal.c needed a bit of change
(we probably ought to unify copy_siginfo_from_user32() on all architectures
that have it, actually).

The sucker survived LTP syscall tests and hasn't died on xfstests yet either
(== doesn't seem to leak horribly).  Consider it a proof-of-concept and no
more than that.  One thing I'm really not satisfied with is signal sending
pathway - it would probably make a lot of sense to use ksiginfo there as
well, and have insertion into queue simply steal info->q instead of allocating
and copying.  Another is that ifdefs like that are tolerable only if we
plan to convert everything during this cycle; I believe it to be doable,
actually.  Having that sucker go before the conversions would make for
simpler logistics wrt architecture trees, but it may be better to get
all conversions merged first.  Not sure...  Anyway, here is it; comments
are welcome.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/arch/Kconfig b/arch/Kconfig
index 80bbb8c..f753561 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -522,4 +522,7 @@ config OLD_SIGACTION
 config COMPAT_OLD_SIGACTION
 	bool
 
+config ARCH_USES_KSIGNAL
+	bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..4ec468d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,6 +127,7 @@ config X86
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 	select HAVE_CC_STACKPROTECTOR
+	select ARCH_USES_KSIGNAL
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 2206757..611a03e 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,7 +34,7 @@
 #include <asm/sys_ia32.h>
 #include <asm/smap.h>
 
-int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
 {
 	int err = 0;
 	bool ia32 = test_thread_flag(TIF_IA32);
@@ -105,6 +105,25 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
 	return err;
 }
 
+int copy_siginfo_to_user32(compat_siginfo_t __user *to, const ksiginfo_t *from)
+{
+	int err = 0;
+	if (!from->q) {
+		if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t)))
+			return -EFAULT;
+		put_user_try {
+			/* fast case */
+			put_user_ex(from->si_signo, &to->si_signo);
+			put_user_ex(0, &to->si_errno);
+			put_user_ex((short)from->si_code, &to->si_code);
+			put_user_ex(from->si_pid, &to->si_pid);
+			put_user_ex(from->si_uid, &to->si_uid);
+		} put_user_catch(err);
+		return err;
+	}
+	return __copy_siginfo_to_user32(to, &from->q->info);
+}
+
 int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from)
 {
 	int err = 0;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 55298db..673ab73 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,10 +201,11 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 						msg.msg_flags);
 
 		if (signal_pending(current)) {
-			siginfo_t info;
+			ksiginfo_t info;
 			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
 				task_pid_nr(current), current->comm,
 				dequeue_signal_lock(current, &current->blocked, &info));
+			dismiss_siginfo(&info);
 			result = -EINTR;
 			sock_shutdown(nbd, !send);
 			break;
diff --git a/drivers/usb/gadget/f_mass_storage.c b/drivers/usb/gadget/f_mass_storage.c
index b963939..ecacac5 100644
--- a/drivers/usb/gadget/f_mass_storage.c
+++ b/drivers/usb/gadget/f_mass_storage.c
@@ -2337,7 +2337,7 @@ static void fsg_disable(struct usb_function *f)
 
 static void handle_exception(struct fsg_common *common)
 {
-	siginfo_t		info;
+	ksiginfo_t	info;
 	int			i;
 	struct fsg_buffhd	*bh;
 	enum fsg_state		old_state;
@@ -2351,6 +2351,7 @@ static void handle_exception(struct fsg_common *common)
 	for (;;) {
 		int sig =
 			dequeue_signal_lock(current, &current->blocked, &info);
+		dismiss_siginfo(&info);
 		if (!sig)
 			break;
 		if (sig != SIGUSR1) {
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..e79a23e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1371,7 +1371,7 @@ static void fill_auxv_note(struct memelfnote *note, struct mm_struct *mm)
 }
 
 static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t *csigdata,
-		const siginfo_t *siginfo)
+		const ksiginfo_t *siginfo)
 {
 	mm_segment_t old_fs = get_fs();
 	set_fs(KERNEL_DS);
@@ -1578,7 +1578,7 @@ static int fill_thread_core_info(struct elf_thread_core_info *t,
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
 			  struct elf_note_info *info,
-			  const siginfo_t *siginfo, struct pt_regs *regs)
+			  const ksiginfo_t *siginfo, struct pt_regs *regs)
 {
 	struct task_struct *dump_task = current;
 	const struct user_regset_view *view = task_user_regset_view(dump_task);
@@ -1827,7 +1827,7 @@ static int elf_note_info_init(struct elf_note_info *info)
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
 			  struct elf_note_info *info,
-			  const siginfo_t *siginfo, struct pt_regs *regs)
+			  const ksiginfo_t *siginfo, struct pt_regs *regs)
 {
 	struct list_head *t;
 	struct core_thread *ct;
diff --git a/fs/coredump.c b/fs/coredump.c
index e3ad709..44ff98d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -484,7 +484,7 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 	return err;
 }
 
-void do_coredump(const siginfo_t *siginfo)
+void do_coredump(const ksiginfo_t *siginfo)
 {
 	struct core_state core_state;
 	struct core_name cn;
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
 		/* Put_super will send a SIGKILL and then wait on the sem.
 		 */
 		while (signal_pending(current) || freezing(current)) {
-			siginfo_t info;
+			ksiginfo_t info;
 			unsigned long signr;
 
 			if (try_to_freeze())
 				goto again;
 
 			signr = dequeue_signal_lock(current, &current->blocked, &info);
+			dismiss_siginfo(&info);
 
 			switch(signr) {
 			case SIGSTOP:
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 424b7b6..d6f3c47 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -78,8 +78,9 @@ static unsigned int signalfd_poll(struct file *file, poll_table *wait)
  * Copied from copy_siginfo_to_user() in kernel/signal.c
  */
 static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
-			     siginfo_t const *kinfo)
+			     const ksiginfo_t *from)
 {
+	const siginfo_t *kinfo;
 	long err;
 
 	BUILD_BUG_ON(sizeof(struct signalfd_siginfo) != 128);
@@ -89,6 +90,19 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
 	 */
 	err = __clear_user(uinfo, sizeof(*uinfo));
 
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+	if (!from->q) {
+		err |= __put_user(from->si_signo, &uinfo->ssi_signo);
+		err |= __put_user(0, &uinfo->ssi_errno);
+		err |= __put_user((short) from->si_code, &uinfo->ssi_code);
+		err |= __put_user(from->si_pid, &uinfo->ssi_pid);
+		err |= __put_user(from->si_uid, &uinfo->ssi_uid);
+		goto out;
+	}
+	kinfo = &from->q->info;
+#else
+	kinfo = from;
+#endif
 	/*
 	 * If you change siginfo_t structure, please be sure
 	 * this code is fixed accordingly.
@@ -152,10 +166,11 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
 		break;
 	}
 
+out:
 	return err ? -EFAULT: sizeof(*uinfo);
 }
 
-static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, siginfo_t *info,
+static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, ksiginfo_t *info,
 				int nonblock)
 {
 	ssize_t ret;
@@ -207,7 +222,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
 	struct signalfd_siginfo __user *siginfo;
 	int nonblock = file->f_flags & O_NONBLOCK;
 	ssize_t ret, total = 0;
-	siginfo_t info;
+	ksiginfo_t info;
 
 	count /= sizeof(struct signalfd_siginfo);
 	if (!count)
@@ -219,6 +234,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
 		if (unlikely(ret <= 0))
 			break;
 		ret = signalfd_copyinfo(siginfo, &info);
+		dismiss_siginfo(&info);
 		if (ret < 0)
 			break;
 		siginfo++;
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 3d1a3af..43a6895 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -15,6 +15,11 @@
 #define __SI_CODE(T,N)	((T) | ((N) & 0xffff))
 
 struct siginfo;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define small_siginfo siginfo
+#else
+struct small_siginfo;
+#endif
 void do_schedule_next_timer(struct siginfo *info);
 
 #ifndef HAVE_ARCH_COPY_SIGINFO
@@ -32,6 +37,8 @@ static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
 
 #endif
 
-extern int copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);
+extern int copy_ksiginfo_from_user(struct small_siginfo *to, const struct siginfo __user *from);
+extern int copy_siginfo_to_user(struct siginfo __user *to, const struct small_siginfo *from);
+extern int __copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);
 
 #endif
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b4a745d..912d8d4 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -56,7 +56,7 @@ struct linux_binprm {
 
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
-	const siginfo_t *siginfo;
+	const ksiginfo_t *siginfo;
 	struct pt_regs *regs;
 	struct file *file;
 	unsigned long limit;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c6..9ab31c9 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -362,7 +362,9 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask,
 long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
 		       unsigned long bitmap_size);
 int copy_siginfo_from_user32(siginfo_t *to, struct compat_siginfo __user *from);
-int copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
+int copy_ksiginfo_from_user32(ksiginfo_t *to, struct compat_siginfo __user *from);
+int copy_siginfo_to_user32(struct compat_siginfo __user *to, const ksiginfo_t *from);
+int __copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
 int get_compat_sigevent(struct sigevent *event,
 		const struct compat_sigevent __user *u_event);
 long compat_sys_rt_tgsigqueueinfo(compat_pid_t tgid, compat_pid_t pid, int sig,
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index d016a12..3c9f79c 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -15,9 +15,9 @@ extern int dump_skip(struct coredump_params *cprm, size_t nr);
 extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
 extern int dump_align(struct coredump_params *cprm, int align);
 #ifdef CONFIG_COREDUMP
-extern void do_coredump(const siginfo_t *siginfo);
+extern void do_coredump(const ksiginfo_t *siginfo);
 #else
-static inline void do_coredump(const siginfo_t *siginfo) {}
+static inline void do_coredump(const ksiginfo_t *siginfo) {}
 #endif
 
 #endif /* _LINUX_COREDUMP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..60834bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1428,7 +1428,7 @@ struct task_struct {
 	struct io_context *io_context;
 
 	unsigned long ptrace_message;
-	siginfo_t *last_siginfo; /* For ptrace use.  */
+	ksiginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
 #if defined(CONFIG_TASK_XACCT)
 	u64 acct_rss_mem1;	/* accumulated rss usage */
@@ -2177,9 +2177,9 @@ extern void flush_signals(struct task_struct *);
 extern void __flush_signals(struct task_struct *);
 extern void ignore_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *, int force_default);
-extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info);
+extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info);
 
-static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int ret;
@@ -2204,7 +2204,6 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
 				const struct cred *, u32);
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
-extern int kill_proc_info(int, struct siginfo *, pid_t);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..c999b4a 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -236,12 +236,29 @@ static inline int valid_signal(unsigned long sig)
 struct timespec;
 struct pt_regs;
 
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+typedef struct small_siginfo {
+	int si_signo;
+	int si_code;
+	union { struct {	/* yecch... */
+		__kernel_pid_t _pid;
+		__ARCH_SI_UID_T _uid;
+	} _kill; } _sifields;
+	struct sigqueue *q;
+} ksiginfo_t;
+extern void dismiss_siginfo(ksiginfo_t *);
+#else
+#define small_siginfo siginfo
+#define ksiginfo_t siginfo_t
+#define dismiss_siginfo(info) ((void)0)
+#endif
+
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int do_send_sig_info(int sig, struct siginfo *info,
 				struct task_struct *p, bool group);
 extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
 extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
-extern int do_sigtimedwait(const sigset_t *, siginfo_t *,
+extern int do_sigtimedwait(const sigset_t *, ksiginfo_t *,
 				const struct timespec *);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
 extern void set_current_blocked(sigset_t *);
@@ -281,13 +298,13 @@ struct old_sigaction {
 
 struct ksignal {
 	struct k_sigaction ka;
-	siginfo_t info;
+	ksiginfo_t info;
 	int sig;
 };
 
-extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
+extern int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
 extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
-extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
+extern void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
 extern void exit_signals(struct task_struct *tsk);
 
 /*
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 1e98b55..c9e7a0e 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -146,7 +146,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
  * Called without locks, shortly before returning to user mode
  * (or handling more signals).
  */
-static inline void tracehook_signal_handler(int sig, siginfo_t *info,
+static inline void tracehook_signal_handler(int sig, ksiginfo_t *info,
 					    const struct k_sigaction *ka,
 					    struct pt_regs *regs, int stepping)
 {
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 39a8a43..1494a3e 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -23,6 +23,27 @@
 		}						\
 	} while (0)
 
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define TP_SI_ERRNO(info)	info->si_errno
+#else
+#define TP_SI_ERRNO(info)	info->q ? info->q->info.si_errno : 0
+#endif
+
+#define TP_STORE_KSIGINFO(__entry, info)			\
+	do {							\
+		if (info == (ksiginfo_t *)SEND_SIG_NOINFO ||	\
+		    info == (ksiginfo_t *)SEND_SIG_FORCED) {	\
+			__entry->errno	= 0;			\
+			__entry->code	= SI_USER;		\
+		} else if (info == (ksiginfo_t *)SEND_SIG_PRIV) {\
+			__entry->errno	= 0;			\
+			__entry->code	= SI_KERNEL;		\
+		} else {					\
+			__entry->errno	= TP_SI_ERRNO(info);	\
+			__entry->code	= info->si_code;	\
+		}						\
+	} while (0)
+
 #ifndef TRACE_HEADER_MULTI_READ
 enum {
 	TRACE_SIGNAL_DELIVERED,
@@ -95,7 +116,7 @@ TRACE_EVENT(signal_generate,
  */
 TRACE_EVENT(signal_deliver,
 
-	TP_PROTO(int sig, struct siginfo *info, struct k_sigaction *ka),
+	TP_PROTO(int sig, ksiginfo_t *info, struct k_sigaction *ka),
 
 	TP_ARGS(sig, info, ka),
 
@@ -109,7 +130,7 @@ TRACE_EVENT(signal_deliver,
 
 	TP_fast_assign(
 		__entry->sig	= sig;
-		TP_STORE_SIGINFO(__entry, info);
+		TP_STORE_KSIGINFO(__entry, info);
 		__entry->sa_handler	= (unsigned long)ka->sa.sa_handler;
 		__entry->sa_flags	= ka->sa.sa_flags;
 	),
diff --git a/kernel/compat.c b/kernel/compat.c
index 0a09e48..82084b4 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -579,7 +579,7 @@ COMPAT_SYSCALL_DEFINE5(waitid,
 
 	BUG_ON(info.si_code & __SI_MASK);
 	info.si_code |= __SI_CHLD;
-	return copy_siginfo_to_user32(uinfo, &info);
+	return __copy_siginfo_to_user32(uinfo, &info);
 }
 
 static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
@@ -981,7 +981,7 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
 	compat_sigset_t s32;
 	sigset_t s;
 	struct timespec t;
-	siginfo_t info;
+	ksiginfo_t info;
 	long ret;
 
 	if (sigsetsize != sizeof(sigset_t))
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1f4bcb3..11223d4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -589,7 +589,7 @@ static int ptrace_setoptions(struct task_struct *child, unsigned long data)
 	return 0;
 }
 
-static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
+static int ptrace_getsiginfo(struct task_struct *child, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int error = -ESRCH;
@@ -605,7 +605,7 @@ static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
 	return error;
 }
 
-static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
+static int ptrace_setsiginfo(struct task_struct *child, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int error = -ESRCH;
@@ -613,11 +613,14 @@ static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
 	if (lock_task_sighand(child, &flags)) {
 		error = -EINVAL;
 		if (likely(child->last_siginfo != NULL)) {
+			dismiss_siginfo(child->last_siginfo);
 			*child->last_siginfo = *info;
 			error = 0;
 		}
 		unlock_task_sighand(child, &flags);
 	}
+	if (error)
+		dismiss_siginfo(info);
 	return error;
 }
 
@@ -666,7 +669,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 		if (unlikely(is_compat_task())) {
 			compat_siginfo_t __user *uinfo = compat_ptr(data);
 
-			if (copy_siginfo_to_user32(uinfo, &info) ||
+			if (__copy_siginfo_to_user32(uinfo, &info) ||
 			    __put_user(info.si_code, &uinfo->si_code)) {
 				ret = -EFAULT;
 				break;
@@ -677,7 +680,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 		{
 			siginfo_t __user *uinfo = (siginfo_t __user *) data;
 
-			if (copy_siginfo_to_user(uinfo, &info) ||
+			if (__copy_siginfo_to_user(uinfo, &info) ||
 			    __put_user(info.si_code, &uinfo->si_code)) {
 				ret = -EFAULT;
 				break;
@@ -805,7 +808,7 @@ int ptrace_request(struct task_struct *child, long request,
 {
 	bool seized = child->ptrace & PT_SEIZED;
 	int ret = -EIO;
-	siginfo_t siginfo, *si;
+	ksiginfo_t siginfo, *si;
 	void __user *datavp = (void __user *) data;
 	unsigned long __user *datalp = datavp;
 	unsigned long flags;
@@ -839,9 +842,8 @@ int ptrace_request(struct task_struct *child, long request,
 		break;
 
 	case PTRACE_SETSIGINFO:
-		if (copy_from_user(&siginfo, datavp, sizeof siginfo))
-			ret = -EFAULT;
-		else
+		ret = copy_ksiginfo_from_user(&siginfo, datavp);
+		if (!ret)
 			ret = ptrace_setsiginfo(child, &siginfo);
 		break;
 
@@ -1107,7 +1109,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
 {
 	compat_ulong_t __user *datap = compat_ptr(data);
 	compat_ulong_t word;
-	siginfo_t siginfo;
+	ksiginfo_t siginfo;
 	int ret;
 
 	switch (request) {
@@ -1139,8 +1141,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
 		break;
 
 	case PTRACE_SETSIGINFO:
-		memset(&siginfo, 0, sizeof siginfo);
-		if (copy_siginfo_from_user32(
+		if (copy_ksiginfo_from_user32(
 			    &siginfo, (struct compat_siginfo __user *) datap))
 			ret = -EFAULT;
 		else
diff --git a/kernel/signal.c b/kernel/signal.c
index 48cf9fc..2b641aa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -540,7 +540,18 @@ unblock_all_signals(void)
 	spin_unlock_irqrestore(&current->sighand->siglock, flags);
 }
 
-static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+void dismiss_siginfo(ksiginfo_t *info)
+{
+	if (info->q)
+		__sigqueue_free(info->q);
+	info->q = NULL;
+}
+EXPORT_SYMBOL(dismiss_siginfo);
+#endif
+
+static void collect_signal(int sig, struct sigpending *list,
+			   ksiginfo_t *info)
 {
 	struct sigqueue *q, *first = NULL;
 
@@ -561,8 +572,14 @@ static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
 	if (first) {
 still_pending:
 		list_del_init(&first->list);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		copy_siginfo(info, &first->info);
 		__sigqueue_free(first);
+#else
+		info->si_signo = sig;
+		info->si_code = first->info.si_code;
+		info->q = first;
+#endif
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be
@@ -570,15 +587,19 @@ still_pending:
 		 * out of queue space.  So zero out the info.
 		 */
 		info->si_signo = sig;
-		info->si_errno = 0;
 		info->si_code = SI_USER;
 		info->si_pid = 0;
 		info->si_uid = 0;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+		info->si_errno = 0;
+#else
+		info->q = NULL;
+#endif
 	}
 }
 
 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
-			siginfo_t *info)
+			ksiginfo_t *info)
 {
 	int sig = next_signal(pending, mask);
 
@@ -604,7 +625,7 @@ static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
  *
  * All callers have to hold the siglock.
  */
-int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
 {
 	int signr;
 
@@ -659,7 +680,12 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 		 */
 		current->jobctl |= JOBCTL_STOP_DEQUEUED;
 	}
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 	if ((info->si_code & __SI_MASK) == __SI_TIMER && info->si_sys_private) {
+#else
+	if ((info->si_code & __SI_MASK) == __SI_TIMER &&
+	    info->q->info.si_sys_private) {
+#endif
 		/*
 		 * Release the siglock to ensure proper locking order
 		 * of timer locks outside of siglocks.  Note, we leave
@@ -667,7 +693,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 		 * about to disable them again anyway.
 		 */
 		spin_unlock(&tsk->sighand->siglock);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		do_schedule_next_timer(info);
+#else
+		do_schedule_next_timer(&info->q->info);
+#endif
 		spin_lock(&tsk->sighand->siglock);
 	}
 	return signr;
@@ -1836,7 +1866,8 @@ static int sigkill_pending(struct task_struct *tsk)
  * If we actually decide not to stop at all because the tracer
  * is gone, we keep current->exit_code unless clear_code.
  */
-static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
+static void ptrace_stop(int exit_code, int why, int clear_code,
+			ksiginfo_t *info)
 	__releases(&current->sighand->siglock)
 	__acquires(&current->sighand->siglock)
 {
@@ -1960,16 +1991,17 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 
 static void ptrace_do_notify(int signr, int exit_code, int why)
 {
-	siginfo_t info;
+	ksiginfo_t info;
 
-	memset(&info, 0, sizeof info);
 	info.si_signo = signr;
 	info.si_code = exit_code;
 	info.si_pid = task_pid_vnr(current);
 	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+	info.q = NULL;
 
 	/* Let the debugger run.  */
 	ptrace_stop(exit_code, why, 1, &info);
+	dismiss_siginfo(&info);
 }
 
 void ptrace_notify(int exit_code)
@@ -2141,7 +2173,7 @@ static void do_jobctl_trap(void)
 	}
 }
 
-static int ptrace_signal(int signr, siginfo_t *info)
+static int ptrace_signal(int signr, ksiginfo_t *info)
 {
 	ptrace_signal_deliver();
 	/*
@@ -2170,8 +2202,11 @@ static int ptrace_signal(int signr, siginfo_t *info)
 	 * have updated *info via PTRACE_SETSIGINFO.
 	 */
 	if (signr != info->si_signo) {
+		dismiss_siginfo(info);
 		info->si_signo = signr;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		info->si_errno = 0;
+#endif
 		info->si_code = SI_USER;
 		rcu_read_lock();
 		info->si_pid = task_pid_vnr(current->parent);
@@ -2182,14 +2217,25 @@ static int ptrace_signal(int signr, siginfo_t *info)
 
 	/* If the (new) signal is now blocked, requeue it.  */
 	if (sigismember(&current->blocked, signr)) {
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		specific_send_sig_info(signr, info, current);
+#else
+		if (!info->q) {
+			struct siginfo s = {.si_code = info->si_code,
+					    .si_signo = signr,
+					    .si_pid = info->si_pid,
+					    .si_uid = info->si_uid};
+			specific_send_sig_info(signr, &s, current);
+		} else
+			specific_send_sig_info(signr, &info->q->info, current);
+#endif
 		signr = 0;
 	}
 
 	return signr;
 }
 
-int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
+int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka,
 			  struct pt_regs *regs, void *cookie)
 {
 	struct sighand_struct *sighand = current->sighand;
@@ -2363,6 +2409,7 @@ relock:
 			 */
 			do_coredump(info);
 		}
+		dismiss_siginfo(info);
 
 		/*
 		 * Death signals, no core dump.
@@ -2377,7 +2424,7 @@ relock:
 /**
  * signal_delivered - 
  * @sig:		number of signal being delivered
- * @info:		siginfo_t of signal being delivered
+ * @info:		ksiginfo_t of signal being delivered
  * @ka:			sigaction setting that chose the handler
  * @regs:		user register state
  * @stepping:		nonzero if debugger single-step or block-step in use
@@ -2387,7 +2434,7 @@ relock:
  * is always blocked, and the signal itself is blocked unless %SA_NODEFER
  * is set in @ka->sa.sa_flags.  Tracing is notified.
  */
-void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka,
+void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka,
 			struct pt_regs *regs, int stepping)
 {
 	sigset_t blocked;
@@ -2412,6 +2459,7 @@ void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
 	else
 		signal_delivered(ksig->sig, &ksig->info, &ksig->ka,
 			signal_pt_regs(), stepping);
+	dismiss_siginfo(&ksig->info);
 }
 
 /*
@@ -2721,7 +2769,7 @@ COMPAT_SYSCALL_DEFINE2(rt_sigpending, compat_sigset_t __user *, uset,
 }
 #endif
 
-int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 {
 	int err;
 
@@ -2804,13 +2852,35 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 	return err;
 }
 
+int copy_siginfo_to_user(siginfo_t __user *to, const ksiginfo_t *from)
+{
+	const siginfo_t *info;
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+	if (!from->q) {
+		int err;
+		if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t)))
+			return -EFAULT;
+		err = __put_user(from->si_signo, &to->si_signo);
+		err |= __put_user(0, &to->si_errno);
+		err |= __put_user((short) from->si_code, &to->si_code);
+		err |= __put_user(from->si_pid, &to->si_pid);
+		err |= __put_user(from->si_uid, &to->si_uid);
+		return err;
+	}
+	info = &from->q->info;
+#else
+	info = from;
+#endif
+	return __copy_siginfo_to_user(to, info);
+}
+
 /**
  *  do_sigtimedwait - wait for queued signals specified in @which
  *  @which: queued signals to wait for
  *  @info: if non-null, the signal's siginfo is returned here
  *  @ts: upper bound on process time suspension
  */
-int do_sigtimedwait(const sigset_t *which, siginfo_t *info,
+int do_sigtimedwait(const sigset_t *which, ksiginfo_t *info,
 			const struct timespec *ts)
 {
 	struct task_struct *tsk = current;
@@ -2878,7 +2948,7 @@ SYSCALL_DEFINE4(rt_sigtimedwait, const sigset_t __user *, uthese,
 {
 	sigset_t these;
 	struct timespec ts;
-	siginfo_t info;
+	ksiginfo_t info;
 	int ret;
 
 	/* XXX: Don't preclude handling different sized sigset_t's.  */
@@ -2999,6 +3069,73 @@ SYSCALL_DEFINE2(tkill, pid_t, pid, int, sig)
 	return do_tkill(0, pid, sig);
 }
 
+int copy_ksiginfo_from_user(ksiginfo_t *to, const siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+	return copy_from_user(to, from, sizeof(siginfo_t)) ? -EFAULT : 0;
+#else
+	struct sigqueue *q;
+	if (!access_ok(VERIFY_READ, from, sizeof(siginfo_t)))
+		return -EFAULT;
+	if (__get_user(to->si_code, &from->si_code))
+		return -EFAULT;
+	if (to->si_code == SI_USER) {
+		if (__get_user(to->si_signo, &from->si_signo) ||
+		    __get_user(to->si_uid, &from->si_uid) ||
+		    __get_user(to->si_pid, &from->si_pid))
+			return -EFAULT;
+		to->q = NULL;
+		return 0;
+	}
+	q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+	if (!q)
+		return -ENOMEM;
+	if (copy_from_user(&q->info, from, sizeof(siginfo_t))) {
+		__sigqueue_free(q);
+		return -EFAULT;
+	}
+	to->q = q;
+	to->si_code = q->info.si_code;
+	to->si_signo = q->info.si_signo;
+	return 0;
+#endif
+}
+
+#ifdef CONFIG_COMPAT
+
+int copy_ksiginfo_from_user32(ksiginfo_t *to, compat_siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+	return copy_siginfo_from_user32(to, from);
+#else
+	struct sigqueue *q;
+	if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t)))
+		return -EFAULT;
+	if (__get_user(to->si_code, &from->si_code))
+		return -EFAULT;
+	if (to->si_code == SI_USER) {
+		if (__get_user(to->si_signo, &from->si_signo) ||
+		    __get_user(to->si_uid, &from->si_uid) ||
+		    __get_user(to->si_pid, &from->si_pid))
+			return -EFAULT;
+		to->q = NULL;
+		return 0;
+	}
+	q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+	if (!q)
+		return -ENOMEM;
+	if (copy_siginfo_from_user32(&q->info, from)) {
+		__sigqueue_free(q);
+		return -EFAULT;
+	}
+	to->q = q;
+	to->si_code = q->info.si_code;
+	to->si_signo = q->info.si_signo;
+	return 0;
+#endif
+}
+#endif
+
 static int do_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t *info)
 {
 	/* Not even root can pretend to send signals from the kernel.

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15  5:25                                 ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15  5:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric Sandeen, Oleg Nesterov, xfs, Dave Jones, Linux Kernel

On Thu, Feb 13, 2014 at 09:58:47AM -0800, Linus Torvalds wrote:
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
> 
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc. That kind
> of detail should be confined to inside signal.c.
> 
> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation). And outside of signal.c only
> signalfd uses "dequeue_signal()" itself, and that would be the only
> one that would need to be taught to use (in signalfd_copyinfo()) and
> then free the sigqueue entry.
> 
> So it really looks like the right thing to do, and fairly
> straightforward. But I'm leaving the coding proof to Al, since he
> already offered ;)

OK, _very_ preliminary patch follows.  It's uglier than it has to and it's
not in the form I would consider suitable for merge.  It doesn't depend
on conversions of architectures to use of ksignal as prereqs; instead of
that it has config symbol (ARCH_USES_KSIGNAL) that can be selected on
architectures already through that conversion (in this patch - just on
x86, since that was all I could test on at the moment; e.g. alpha and
arm could also get such selects right now).  If it isn't selected, everything
builds as usual.  With large siginfo on stack for signal path, as before.
We get ksiginfo_t as an alias for siginfo_t in that case, dismiss_siginfo()
is a no-op, etc.  If it _is_ selected, ksiginfo_t is much smaller than
siginfo_t.

Of course, we pay for that with ifdefs in places that need to understand
[k]siginfo_t guts; signalfd_copyinfo(), copy_siginfo_from_user{,32}() and
copy_ksiginfo_to_user{,32}(), tracepoint for deliver_signal(), collect_signal(),
trivial ones in dequeue_signal() and an ugly bugger in ptrace_signal().
No way around that until all architectures are converted.  Other places
get away with some siginfo_t replaced with ksiginfo_t and dismiss_siginfo()
added in some places.  arch/x86/ia32/ia32_signal.c needed a bit of change
(we probably ought to unify copy_siginfo_from_user32() on all architectures
that have it, actually).

The sucker survived LTP syscall tests and hasn't died on xfstests yet either
(== doesn't seem to leak horribly).  Consider it a proof-of-concept and no
more than that.  One thing I'm really not satisfied with is signal sending
pathway - it would probably make a lot of sense to use ksiginfo there as
well, and have insertion into queue simply steal info->q instead of allocating
and copying.  Another is that ifdefs like that are tolerable only if we
plan to convert everything during this cycle; I believe it to be doable,
actually.  Having that sucker go before the conversions would make for
simpler logistics wrt architecture trees, but it may be better to get
all conversions merged first.  Not sure...  Anyway, here is it; comments
are welcome.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/arch/Kconfig b/arch/Kconfig
index 80bbb8c..f753561 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -522,4 +522,7 @@ config OLD_SIGACTION
 config COMPAT_OLD_SIGACTION
 	bool
 
+config ARCH_USES_KSIGNAL
+	bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..4ec468d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,6 +127,7 @@ config X86
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 	select HAVE_CC_STACKPROTECTOR
+	select ARCH_USES_KSIGNAL
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 2206757..611a03e 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,7 +34,7 @@
 #include <asm/sys_ia32.h>
 #include <asm/smap.h>
 
-int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
 {
 	int err = 0;
 	bool ia32 = test_thread_flag(TIF_IA32);
@@ -105,6 +105,25 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
 	return err;
 }
 
+int copy_siginfo_to_user32(compat_siginfo_t __user *to, const ksiginfo_t *from)
+{
+	int err = 0;
+	if (!from->q) {
+		if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t)))
+			return -EFAULT;
+		put_user_try {
+			/* fast case */
+			put_user_ex(from->si_signo, &to->si_signo);
+			put_user_ex(0, &to->si_errno);
+			put_user_ex((short)from->si_code, &to->si_code);
+			put_user_ex(from->si_pid, &to->si_pid);
+			put_user_ex(from->si_uid, &to->si_uid);
+		} put_user_catch(err);
+		return err;
+	}
+	return __copy_siginfo_to_user32(to, &from->q->info);
+}
+
 int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from)
 {
 	int err = 0;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 55298db..673ab73 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,10 +201,11 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 						msg.msg_flags);
 
 		if (signal_pending(current)) {
-			siginfo_t info;
+			ksiginfo_t info;
 			printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
 				task_pid_nr(current), current->comm,
 				dequeue_signal_lock(current, &current->blocked, &info));
+			dismiss_siginfo(&info);
 			result = -EINTR;
 			sock_shutdown(nbd, !send);
 			break;
diff --git a/drivers/usb/gadget/f_mass_storage.c b/drivers/usb/gadget/f_mass_storage.c
index b963939..ecacac5 100644
--- a/drivers/usb/gadget/f_mass_storage.c
+++ b/drivers/usb/gadget/f_mass_storage.c
@@ -2337,7 +2337,7 @@ static void fsg_disable(struct usb_function *f)
 
 static void handle_exception(struct fsg_common *common)
 {
-	siginfo_t		info;
+	ksiginfo_t	info;
 	int			i;
 	struct fsg_buffhd	*bh;
 	enum fsg_state		old_state;
@@ -2351,6 +2351,7 @@ static void handle_exception(struct fsg_common *common)
 	for (;;) {
 		int sig =
 			dequeue_signal_lock(current, &current->blocked, &info);
+		dismiss_siginfo(&info);
 		if (!sig)
 			break;
 		if (sig != SIGUSR1) {
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..e79a23e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1371,7 +1371,7 @@ static void fill_auxv_note(struct memelfnote *note, struct mm_struct *mm)
 }
 
 static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t *csigdata,
-		const siginfo_t *siginfo)
+		const ksiginfo_t *siginfo)
 {
 	mm_segment_t old_fs = get_fs();
 	set_fs(KERNEL_DS);
@@ -1578,7 +1578,7 @@ static int fill_thread_core_info(struct elf_thread_core_info *t,
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
 			  struct elf_note_info *info,
-			  const siginfo_t *siginfo, struct pt_regs *regs)
+			  const ksiginfo_t *siginfo, struct pt_regs *regs)
 {
 	struct task_struct *dump_task = current;
 	const struct user_regset_view *view = task_user_regset_view(dump_task);
@@ -1827,7 +1827,7 @@ static int elf_note_info_init(struct elf_note_info *info)
 
 static int fill_note_info(struct elfhdr *elf, int phdrs,
 			  struct elf_note_info *info,
-			  const siginfo_t *siginfo, struct pt_regs *regs)
+			  const ksiginfo_t *siginfo, struct pt_regs *regs)
 {
 	struct list_head *t;
 	struct core_thread *ct;
diff --git a/fs/coredump.c b/fs/coredump.c
index e3ad709..44ff98d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -484,7 +484,7 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 	return err;
 }
 
-void do_coredump(const siginfo_t *siginfo)
+void do_coredump(const ksiginfo_t *siginfo)
 {
 	struct core_state core_state;
 	struct core_name cn;
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
 		/* Put_super will send a SIGKILL and then wait on the sem.
 		 */
 		while (signal_pending(current) || freezing(current)) {
-			siginfo_t info;
+			ksiginfo_t info;
 			unsigned long signr;
 
 			if (try_to_freeze())
 				goto again;
 
 			signr = dequeue_signal_lock(current, &current->blocked, &info);
+			dismiss_siginfo(&info);
 
 			switch(signr) {
 			case SIGSTOP:
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 424b7b6..d6f3c47 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -78,8 +78,9 @@ static unsigned int signalfd_poll(struct file *file, poll_table *wait)
  * Copied from copy_siginfo_to_user() in kernel/signal.c
  */
 static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
-			     siginfo_t const *kinfo)
+			     const ksiginfo_t *from)
 {
+	const siginfo_t *kinfo;
 	long err;
 
 	BUILD_BUG_ON(sizeof(struct signalfd_siginfo) != 128);
@@ -89,6 +90,19 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
 	 */
 	err = __clear_user(uinfo, sizeof(*uinfo));
 
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+	if (!from->q) {
+		err |= __put_user(from->si_signo, &uinfo->ssi_signo);
+		err |= __put_user(0, &uinfo->ssi_errno);
+		err |= __put_user((short) from->si_code, &uinfo->ssi_code);
+		err |= __put_user(from->si_pid, &uinfo->ssi_pid);
+		err |= __put_user(from->si_uid, &uinfo->ssi_uid);
+		goto out;
+	}
+	kinfo = &from->q->info;
+#else
+	kinfo = from;
+#endif
 	/*
 	 * If you change siginfo_t structure, please be sure
 	 * this code is fixed accordingly.
@@ -152,10 +166,11 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
 		break;
 	}
 
+out:
 	return err ? -EFAULT: sizeof(*uinfo);
 }
 
-static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, siginfo_t *info,
+static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, ksiginfo_t *info,
 				int nonblock)
 {
 	ssize_t ret;
@@ -207,7 +222,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
 	struct signalfd_siginfo __user *siginfo;
 	int nonblock = file->f_flags & O_NONBLOCK;
 	ssize_t ret, total = 0;
-	siginfo_t info;
+	ksiginfo_t info;
 
 	count /= sizeof(struct signalfd_siginfo);
 	if (!count)
@@ -219,6 +234,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
 		if (unlikely(ret <= 0))
 			break;
 		ret = signalfd_copyinfo(siginfo, &info);
+		dismiss_siginfo(&info);
 		if (ret < 0)
 			break;
 		siginfo++;
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 3d1a3af..43a6895 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -15,6 +15,11 @@
 #define __SI_CODE(T,N)	((T) | ((N) & 0xffff))
 
 struct siginfo;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define small_siginfo siginfo
+#else
+struct small_siginfo;
+#endif
 void do_schedule_next_timer(struct siginfo *info);
 
 #ifndef HAVE_ARCH_COPY_SIGINFO
@@ -32,6 +37,8 @@ static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
 
 #endif
 
-extern int copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);
+extern int copy_ksiginfo_from_user(struct small_siginfo *to, const struct siginfo __user *from);
+extern int copy_siginfo_to_user(struct siginfo __user *to, const struct small_siginfo *from);
+extern int __copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);
 
 #endif
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b4a745d..912d8d4 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -56,7 +56,7 @@ struct linux_binprm {
 
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
-	const siginfo_t *siginfo;
+	const ksiginfo_t *siginfo;
 	struct pt_regs *regs;
 	struct file *file;
 	unsigned long limit;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c6..9ab31c9 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -362,7 +362,9 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask,
 long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
 		       unsigned long bitmap_size);
 int copy_siginfo_from_user32(siginfo_t *to, struct compat_siginfo __user *from);
-int copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
+int copy_ksiginfo_from_user32(ksiginfo_t *to, struct compat_siginfo __user *from);
+int copy_siginfo_to_user32(struct compat_siginfo __user *to, const ksiginfo_t *from);
+int __copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
 int get_compat_sigevent(struct sigevent *event,
 		const struct compat_sigevent __user *u_event);
 long compat_sys_rt_tgsigqueueinfo(compat_pid_t tgid, compat_pid_t pid, int sig,
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index d016a12..3c9f79c 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -15,9 +15,9 @@ extern int dump_skip(struct coredump_params *cprm, size_t nr);
 extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
 extern int dump_align(struct coredump_params *cprm, int align);
 #ifdef CONFIG_COREDUMP
-extern void do_coredump(const siginfo_t *siginfo);
+extern void do_coredump(const ksiginfo_t *siginfo);
 #else
-static inline void do_coredump(const siginfo_t *siginfo) {}
+static inline void do_coredump(const ksiginfo_t *siginfo) {}
 #endif
 
 #endif /* _LINUX_COREDUMP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..60834bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1428,7 +1428,7 @@ struct task_struct {
 	struct io_context *io_context;
 
 	unsigned long ptrace_message;
-	siginfo_t *last_siginfo; /* For ptrace use.  */
+	ksiginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
 #if defined(CONFIG_TASK_XACCT)
 	u64 acct_rss_mem1;	/* accumulated rss usage */
@@ -2177,9 +2177,9 @@ extern void flush_signals(struct task_struct *);
 extern void __flush_signals(struct task_struct *);
 extern void ignore_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *, int force_default);
-extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info);
+extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info);
 
-static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int ret;
@@ -2204,7 +2204,6 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
 				const struct cred *, u32);
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
-extern int kill_proc_info(int, struct siginfo *, pid_t);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..c999b4a 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -236,12 +236,29 @@ static inline int valid_signal(unsigned long sig)
 struct timespec;
 struct pt_regs;
 
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+typedef struct small_siginfo {
+	int si_signo;
+	int si_code;
+	union { struct {	/* yecch... */
+		__kernel_pid_t _pid;
+		__ARCH_SI_UID_T _uid;
+	} _kill; } _sifields;
+	struct sigqueue *q;
+} ksiginfo_t;
+extern void dismiss_siginfo(ksiginfo_t *);
+#else
+#define small_siginfo siginfo
+#define ksiginfo_t siginfo_t
+#define dismiss_siginfo(info) ((void)0)
+#endif
+
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int do_send_sig_info(int sig, struct siginfo *info,
 				struct task_struct *p, bool group);
 extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
 extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
-extern int do_sigtimedwait(const sigset_t *, siginfo_t *,
+extern int do_sigtimedwait(const sigset_t *, ksiginfo_t *,
 				const struct timespec *);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
 extern void set_current_blocked(sigset_t *);
@@ -281,13 +298,13 @@ struct old_sigaction {
 
 struct ksignal {
 	struct k_sigaction ka;
-	siginfo_t info;
+	ksiginfo_t info;
 	int sig;
 };
 
-extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
+extern int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
 extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
-extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
+extern void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
 extern void exit_signals(struct task_struct *tsk);
 
 /*
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 1e98b55..c9e7a0e 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -146,7 +146,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
  * Called without locks, shortly before returning to user mode
  * (or handling more signals).
  */
-static inline void tracehook_signal_handler(int sig, siginfo_t *info,
+static inline void tracehook_signal_handler(int sig, ksiginfo_t *info,
 					    const struct k_sigaction *ka,
 					    struct pt_regs *regs, int stepping)
 {
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 39a8a43..1494a3e 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -23,6 +23,27 @@
 		}						\
 	} while (0)
 
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define TP_SI_ERRNO(info)	info->si_errno
+#else
+#define TP_SI_ERRNO(info)	info->q ? info->q->info.si_errno : 0
+#endif
+
+#define TP_STORE_KSIGINFO(__entry, info)			\
+	do {							\
+		if (info == (ksiginfo_t *)SEND_SIG_NOINFO ||	\
+		    info == (ksiginfo_t *)SEND_SIG_FORCED) {	\
+			__entry->errno	= 0;			\
+			__entry->code	= SI_USER;		\
+		} else if (info == (ksiginfo_t *)SEND_SIG_PRIV) {\
+			__entry->errno	= 0;			\
+			__entry->code	= SI_KERNEL;		\
+		} else {					\
+			__entry->errno	= TP_SI_ERRNO(info);	\
+			__entry->code	= info->si_code;	\
+		}						\
+	} while (0)
+
 #ifndef TRACE_HEADER_MULTI_READ
 enum {
 	TRACE_SIGNAL_DELIVERED,
@@ -95,7 +116,7 @@ TRACE_EVENT(signal_generate,
  */
 TRACE_EVENT(signal_deliver,
 
-	TP_PROTO(int sig, struct siginfo *info, struct k_sigaction *ka),
+	TP_PROTO(int sig, ksiginfo_t *info, struct k_sigaction *ka),
 
 	TP_ARGS(sig, info, ka),
 
@@ -109,7 +130,7 @@ TRACE_EVENT(signal_deliver,
 
 	TP_fast_assign(
 		__entry->sig	= sig;
-		TP_STORE_SIGINFO(__entry, info);
+		TP_STORE_KSIGINFO(__entry, info);
 		__entry->sa_handler	= (unsigned long)ka->sa.sa_handler;
 		__entry->sa_flags	= ka->sa.sa_flags;
 	),
diff --git a/kernel/compat.c b/kernel/compat.c
index 0a09e48..82084b4 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -579,7 +579,7 @@ COMPAT_SYSCALL_DEFINE5(waitid,
 
 	BUG_ON(info.si_code & __SI_MASK);
 	info.si_code |= __SI_CHLD;
-	return copy_siginfo_to_user32(uinfo, &info);
+	return __copy_siginfo_to_user32(uinfo, &info);
 }
 
 static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
@@ -981,7 +981,7 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
 	compat_sigset_t s32;
 	sigset_t s;
 	struct timespec t;
-	siginfo_t info;
+	ksiginfo_t info;
 	long ret;
 
 	if (sigsetsize != sizeof(sigset_t))
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1f4bcb3..11223d4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -589,7 +589,7 @@ static int ptrace_setoptions(struct task_struct *child, unsigned long data)
 	return 0;
 }
 
-static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
+static int ptrace_getsiginfo(struct task_struct *child, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int error = -ESRCH;
@@ -605,7 +605,7 @@ static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
 	return error;
 }
 
-static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
+static int ptrace_setsiginfo(struct task_struct *child, ksiginfo_t *info)
 {
 	unsigned long flags;
 	int error = -ESRCH;
@@ -613,11 +613,14 @@ static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
 	if (lock_task_sighand(child, &flags)) {
 		error = -EINVAL;
 		if (likely(child->last_siginfo != NULL)) {
+			dismiss_siginfo(child->last_siginfo);
 			*child->last_siginfo = *info;
 			error = 0;
 		}
 		unlock_task_sighand(child, &flags);
 	}
+	if (error)
+		dismiss_siginfo(info);
 	return error;
 }
 
@@ -666,7 +669,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 		if (unlikely(is_compat_task())) {
 			compat_siginfo_t __user *uinfo = compat_ptr(data);
 
-			if (copy_siginfo_to_user32(uinfo, &info) ||
+			if (__copy_siginfo_to_user32(uinfo, &info) ||
 			    __put_user(info.si_code, &uinfo->si_code)) {
 				ret = -EFAULT;
 				break;
@@ -677,7 +680,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 		{
 			siginfo_t __user *uinfo = (siginfo_t __user *) data;
 
-			if (copy_siginfo_to_user(uinfo, &info) ||
+			if (__copy_siginfo_to_user(uinfo, &info) ||
 			    __put_user(info.si_code, &uinfo->si_code)) {
 				ret = -EFAULT;
 				break;
@@ -805,7 +808,7 @@ int ptrace_request(struct task_struct *child, long request,
 {
 	bool seized = child->ptrace & PT_SEIZED;
 	int ret = -EIO;
-	siginfo_t siginfo, *si;
+	ksiginfo_t siginfo, *si;
 	void __user *datavp = (void __user *) data;
 	unsigned long __user *datalp = datavp;
 	unsigned long flags;
@@ -839,9 +842,8 @@ int ptrace_request(struct task_struct *child, long request,
 		break;
 
 	case PTRACE_SETSIGINFO:
-		if (copy_from_user(&siginfo, datavp, sizeof siginfo))
-			ret = -EFAULT;
-		else
+		ret = copy_ksiginfo_from_user(&siginfo, datavp);
+		if (!ret)
 			ret = ptrace_setsiginfo(child, &siginfo);
 		break;
 
@@ -1107,7 +1109,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
 {
 	compat_ulong_t __user *datap = compat_ptr(data);
 	compat_ulong_t word;
-	siginfo_t siginfo;
+	ksiginfo_t siginfo;
 	int ret;
 
 	switch (request) {
@@ -1139,8 +1141,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
 		break;
 
 	case PTRACE_SETSIGINFO:
-		memset(&siginfo, 0, sizeof siginfo);
-		if (copy_siginfo_from_user32(
+		if (copy_ksiginfo_from_user32(
 			    &siginfo, (struct compat_siginfo __user *) datap))
 			ret = -EFAULT;
 		else
diff --git a/kernel/signal.c b/kernel/signal.c
index 48cf9fc..2b641aa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -540,7 +540,18 @@ unblock_all_signals(void)
 	spin_unlock_irqrestore(&current->sighand->siglock, flags);
 }
 
-static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+void dismiss_siginfo(ksiginfo_t *info)
+{
+	if (info->q)
+		__sigqueue_free(info->q);
+	info->q = NULL;
+}
+EXPORT_SYMBOL(dismiss_siginfo);
+#endif
+
+static void collect_signal(int sig, struct sigpending *list,
+			   ksiginfo_t *info)
 {
 	struct sigqueue *q, *first = NULL;
 
@@ -561,8 +572,14 @@ static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
 	if (first) {
 still_pending:
 		list_del_init(&first->list);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		copy_siginfo(info, &first->info);
 		__sigqueue_free(first);
+#else
+		info->si_signo = sig;
+		info->si_code = first->info.si_code;
+		info->q = first;
+#endif
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be
@@ -570,15 +587,19 @@ still_pending:
 		 * out of queue space.  So zero out the info.
 		 */
 		info->si_signo = sig;
-		info->si_errno = 0;
 		info->si_code = SI_USER;
 		info->si_pid = 0;
 		info->si_uid = 0;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+		info->si_errno = 0;
+#else
+		info->q = NULL;
+#endif
 	}
 }
 
 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
-			siginfo_t *info)
+			ksiginfo_t *info)
 {
 	int sig = next_signal(pending, mask);
 
@@ -604,7 +625,7 @@ static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
  *
  * All callers have to hold the siglock.
  */
-int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
 {
 	int signr;
 
@@ -659,7 +680,12 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 		 */
 		current->jobctl |= JOBCTL_STOP_DEQUEUED;
 	}
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 	if ((info->si_code & __SI_MASK) == __SI_TIMER && info->si_sys_private) {
+#else
+	if ((info->si_code & __SI_MASK) == __SI_TIMER &&
+	    info->q->info.si_sys_private) {
+#endif
 		/*
 		 * Release the siglock to ensure proper locking order
 		 * of timer locks outside of siglocks.  Note, we leave
@@ -667,7 +693,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 		 * about to disable them again anyway.
 		 */
 		spin_unlock(&tsk->sighand->siglock);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		do_schedule_next_timer(info);
+#else
+		do_schedule_next_timer(&info->q->info);
+#endif
 		spin_lock(&tsk->sighand->siglock);
 	}
 	return signr;
@@ -1836,7 +1866,8 @@ static int sigkill_pending(struct task_struct *tsk)
  * If we actually decide not to stop at all because the tracer
  * is gone, we keep current->exit_code unless clear_code.
  */
-static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
+static void ptrace_stop(int exit_code, int why, int clear_code,
+			ksiginfo_t *info)
 	__releases(&current->sighand->siglock)
 	__acquires(&current->sighand->siglock)
 {
@@ -1960,16 +1991,17 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
 
 static void ptrace_do_notify(int signr, int exit_code, int why)
 {
-	siginfo_t info;
+	ksiginfo_t info;
 
-	memset(&info, 0, sizeof info);
 	info.si_signo = signr;
 	info.si_code = exit_code;
 	info.si_pid = task_pid_vnr(current);
 	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+	info.q = NULL;
 
 	/* Let the debugger run.  */
 	ptrace_stop(exit_code, why, 1, &info);
+	dismiss_siginfo(&info);
 }
 
 void ptrace_notify(int exit_code)
@@ -2141,7 +2173,7 @@ static void do_jobctl_trap(void)
 	}
 }
 
-static int ptrace_signal(int signr, siginfo_t *info)
+static int ptrace_signal(int signr, ksiginfo_t *info)
 {
 	ptrace_signal_deliver();
 	/*
@@ -2170,8 +2202,11 @@ static int ptrace_signal(int signr, siginfo_t *info)
 	 * have updated *info via PTRACE_SETSIGINFO.
 	 */
 	if (signr != info->si_signo) {
+		dismiss_siginfo(info);
 		info->si_signo = signr;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		info->si_errno = 0;
+#endif
 		info->si_code = SI_USER;
 		rcu_read_lock();
 		info->si_pid = task_pid_vnr(current->parent);
@@ -2182,14 +2217,25 @@ static int ptrace_signal(int signr, siginfo_t *info)
 
 	/* If the (new) signal is now blocked, requeue it.  */
 	if (sigismember(&current->blocked, signr)) {
+#ifndef CONFIG_ARCH_USES_KSIGNAL
 		specific_send_sig_info(signr, info, current);
+#else
+		if (!info->q) {
+			struct siginfo s = {.si_code = info->si_code,
+					    .si_signo = signr,
+					    .si_pid = info->si_pid,
+					    .si_uid = info->si_uid};
+			specific_send_sig_info(signr, &s, current);
+		} else
+			specific_send_sig_info(signr, &info->q->info, current);
+#endif
 		signr = 0;
 	}
 
 	return signr;
 }
 
-int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
+int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka,
 			  struct pt_regs *regs, void *cookie)
 {
 	struct sighand_struct *sighand = current->sighand;
@@ -2363,6 +2409,7 @@ relock:
 			 */
 			do_coredump(info);
 		}
+		dismiss_siginfo(info);
 
 		/*
 		 * Death signals, no core dump.
@@ -2377,7 +2424,7 @@ relock:
 /**
  * signal_delivered - 
  * @sig:		number of signal being delivered
- * @info:		siginfo_t of signal being delivered
+ * @info:		ksiginfo_t of signal being delivered
  * @ka:			sigaction setting that chose the handler
  * @regs:		user register state
  * @stepping:		nonzero if debugger single-step or block-step in use
@@ -2387,7 +2434,7 @@ relock:
  * is always blocked, and the signal itself is blocked unless %SA_NODEFER
  * is set in @ka->sa.sa_flags.  Tracing is notified.
  */
-void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka,
+void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka,
 			struct pt_regs *regs, int stepping)
 {
 	sigset_t blocked;
@@ -2412,6 +2459,7 @@ void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
 	else
 		signal_delivered(ksig->sig, &ksig->info, &ksig->ka,
 			signal_pt_regs(), stepping);
+	dismiss_siginfo(&ksig->info);
 }
 
 /*
@@ -2721,7 +2769,7 @@ COMPAT_SYSCALL_DEFINE2(rt_sigpending, compat_sigset_t __user *, uset,
 }
 #endif
 
-int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 {
 	int err;
 
@@ -2804,13 +2852,35 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 	return err;
 }
 
+int copy_siginfo_to_user(siginfo_t __user *to, const ksiginfo_t *from)
+{
+	const siginfo_t *info;
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+	if (!from->q) {
+		int err;
+		if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t)))
+			return -EFAULT;
+		err = __put_user(from->si_signo, &to->si_signo);
+		err |= __put_user(0, &to->si_errno);
+		err |= __put_user((short) from->si_code, &to->si_code);
+		err |= __put_user(from->si_pid, &to->si_pid);
+		err |= __put_user(from->si_uid, &to->si_uid);
+		return err;
+	}
+	info = &from->q->info;
+#else
+	info = from;
+#endif
+	return __copy_siginfo_to_user(to, info);
+}
+
 /**
  *  do_sigtimedwait - wait for queued signals specified in @which
  *  @which: queued signals to wait for
  *  @info: if non-null, the signal's siginfo is returned here
  *  @ts: upper bound on process time suspension
  */
-int do_sigtimedwait(const sigset_t *which, siginfo_t *info,
+int do_sigtimedwait(const sigset_t *which, ksiginfo_t *info,
 			const struct timespec *ts)
 {
 	struct task_struct *tsk = current;
@@ -2878,7 +2948,7 @@ SYSCALL_DEFINE4(rt_sigtimedwait, const sigset_t __user *, uthese,
 {
 	sigset_t these;
 	struct timespec ts;
-	siginfo_t info;
+	ksiginfo_t info;
 	int ret;
 
 	/* XXX: Don't preclude handling different sized sigset_t's.  */
@@ -2999,6 +3069,73 @@ SYSCALL_DEFINE2(tkill, pid_t, pid, int, sig)
 	return do_tkill(0, pid, sig);
 }
 
+int copy_ksiginfo_from_user(ksiginfo_t *to, const siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+	return copy_from_user(to, from, sizeof(siginfo_t)) ? -EFAULT : 0;
+#else
+	struct sigqueue *q;
+	if (!access_ok(VERIFY_READ, from, sizeof(siginfo_t)))
+		return -EFAULT;
+	if (__get_user(to->si_code, &from->si_code))
+		return -EFAULT;
+	if (to->si_code == SI_USER) {
+		if (__get_user(to->si_signo, &from->si_signo) ||
+		    __get_user(to->si_uid, &from->si_uid) ||
+		    __get_user(to->si_pid, &from->si_pid))
+			return -EFAULT;
+		to->q = NULL;
+		return 0;
+	}
+	q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+	if (!q)
+		return -ENOMEM;
+	if (copy_from_user(&q->info, from, sizeof(siginfo_t))) {
+		__sigqueue_free(q);
+		return -EFAULT;
+	}
+	to->q = q;
+	to->si_code = q->info.si_code;
+	to->si_signo = q->info.si_signo;
+	return 0;
+#endif
+}
+
+#ifdef CONFIG_COMPAT
+
+int copy_ksiginfo_from_user32(ksiginfo_t *to, compat_siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+	return copy_siginfo_from_user32(to, from);
+#else
+	struct sigqueue *q;
+	if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t)))
+		return -EFAULT;
+	if (__get_user(to->si_code, &from->si_code))
+		return -EFAULT;
+	if (to->si_code == SI_USER) {
+		if (__get_user(to->si_signo, &from->si_signo) ||
+		    __get_user(to->si_uid, &from->si_uid) ||
+		    __get_user(to->si_pid, &from->si_pid))
+			return -EFAULT;
+		to->q = NULL;
+		return 0;
+	}
+	q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+	if (!q)
+		return -ENOMEM;
+	if (copy_siginfo_from_user32(&q->info, from)) {
+		__sigqueue_free(q);
+		return -EFAULT;
+	}
+	to->q = q;
+	to->si_code = q->info.si_code;
+	to->si_signo = q->info.si_signo;
+	return 0;
+#endif
+}
+#endif
+
 static int do_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t *info)
 {
 	/* Not even root can pretend to send signals from the kernel.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15  5:25                                 ` Al Viro
@ 2014-02-15 14:27                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 14:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/15, Al Viro wrote:
>
> OK, _very_ preliminary patch follows.  It's uglier than it has to

And I'm afraid it needs more uglifications...

> +void dismiss_siginfo(ksiginfo_t *info)
> +{
> +	if (info->q)
> +		__sigqueue_free(info->q);
> +	info->q = NULL;
> +}

1. info->q can be already freed if SIGQUEUE_PREALLOC.

   Once get_signal_to_deliver() or any other caller drops ->siglock
   another thread can do sys_timer_delete()->sigqueue_free().

2. We need to move do_schedule_next_timer() from dequeue_signal()
   here.

   Otherwise ->q can be reused/overwritten by the next send_sigqueue()
   right affter ->siglock is dropped.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 14:27                                   ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 14:27 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/15, Al Viro wrote:
>
> OK, _very_ preliminary patch follows.  It's uglier than it has to

And I'm afraid it needs more uglifications...

> +void dismiss_siginfo(ksiginfo_t *info)
> +{
> +	if (info->q)
> +		__sigqueue_free(info->q);
> +	info->q = NULL;
> +}

1. info->q can be already freed if SIGQUEUE_PREALLOC.

   Once get_signal_to_deliver() or any other caller drops ->siglock
   another thread can do sys_timer_delete()->sigqueue_free().

2. We need to move do_schedule_next_timer() from dequeue_signal()
   here.

   Otherwise ->q can be reused/overwritten by the next send_sigqueue()
   right affter ->siglock is dropped.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 16:19                                 ` Christoph Hellwig
@ 2014-02-15 14:46                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 14:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Linus Torvalds, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On 02/14, Christoph Hellwig wrote:
>
> Might aswell stick the discmiss into what was dequeue_signal_lock().
> Which at that point should get a saner name (maybe thread_dequeue_signal ?)
> and lose all argument except maybe task_struct

No, task_struct argument should die, I think. It is misleading.

spin_lock(tsk->sighand->siglock) is simply wrong unless tsk == current.

And dequeue_signal() assumes that tsk == current too, otherwise
recalc_sigpending() is wrong.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 14:46                                   ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 14:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Sandeen, Linux Kernel, xfs, Al Viro, Dave Jones, Linus Torvalds

On 02/14, Christoph Hellwig wrote:
>
> Might aswell stick the discmiss into what was dequeue_signal_lock().
> Which at that point should get a saner name (maybe thread_dequeue_signal ?)
> and lose all argument except maybe task_struct

No, task_struct argument should die, I think. It is misleading.

spin_lock(tsk->sighand->siglock) is simply wrong unless tsk == current.

And dequeue_signal() assumes that tsk == current too, otherwise
recalc_sigpending() is wrong.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 14:27                                   ` Oleg Nesterov
@ 2014-02-15 15:22                                     ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:

> 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> 
>    Once get_signal_to_deliver() or any other caller drops ->siglock
>    another thread can do sys_timer_delete()->sigqueue_free().

How the devil would it find the sucker?  It's off the list already.

> 2. We need to move do_schedule_next_timer() from dequeue_signal()
>    here.
> 
>    Otherwise ->q can be reused/overwritten by the next send_sigqueue()
>    right affter ->siglock is dropped.

Ditto.  We rip them out of queue on collect_signal(); the only thing we do
not do is actual __sigqueue_free().

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 15:22                                     ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:22 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:

> 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> 
>    Once get_signal_to_deliver() or any other caller drops ->siglock
>    another thread can do sys_timer_delete()->sigqueue_free().

How the devil would it find the sucker?  It's off the list already.

> 2. We need to move do_schedule_next_timer() from dequeue_signal()
>    here.
> 
>    Otherwise ->q can be reused/overwritten by the next send_sigqueue()
>    right affter ->siglock is dropped.

Ditto.  We rip them out of queue on collect_signal(); the only thing we do
not do is actual __sigqueue_free().

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 15:22                                     ` Al Viro
@ 2014-02-15 15:33                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 15:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
>
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> >
> >    Once get_signal_to_deliver() or any other caller drops ->siglock
> >    another thread can do sys_timer_delete()->sigqueue_free().
>
> How the devil would it find the sucker?

It simply frees the SIGQUEUE_PREALLOC sigqueue, k_itimer->sigq.

> It's off the list already.

Exactly, list_empty(q->list) == T. So release_posix_timer()->sigqueue_free()
assumes we can safely free it.

> > 2. We need to move do_schedule_next_timer() from dequeue_signal()
> >    here.
> >
> >    Otherwise ->q can be reused/overwritten by the next send_sigqueue()
> >    right affter ->siglock is dropped.
>
> Ditto.  We rip them out of queue on collect_signal();

Yes, and dequeue_signal()->do_schedule_next_timer() can trigger another
send_sigqueue() which uses the same SIGQUEUE_PREALLOC sigqueue once we
drop ->siglock.

This is not that bad, but at least ->si_overrun can be overwritten
before __setup_rt_frame()->copy_siginfo_to_user().

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 15:33                                       ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 15:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
>
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> >
> >    Once get_signal_to_deliver() or any other caller drops ->siglock
> >    another thread can do sys_timer_delete()->sigqueue_free().
>
> How the devil would it find the sucker?

It simply frees the SIGQUEUE_PREALLOC sigqueue, k_itimer->sigq.

> It's off the list already.

Exactly, list_empty(q->list) == T. So release_posix_timer()->sigqueue_free()
assumes we can safely free it.

> > 2. We need to move do_schedule_next_timer() from dequeue_signal()
> >    here.
> >
> >    Otherwise ->q can be reused/overwritten by the next send_sigqueue()
> >    right affter ->siglock is dropped.
>
> Ditto.  We rip them out of queue on collect_signal();

Yes, and dequeue_signal()->do_schedule_next_timer() can trigger another
send_sigqueue() which uses the same SIGQUEUE_PREALLOC sigqueue once we
drop ->siglock.

This is not that bad, but at least ->si_overrun can be overwritten
before __setup_rt_frame()->copy_siginfo_to_user().

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 15:22                                     ` Al Viro
@ 2014-02-15 15:36                                       ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> 
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > 
> >    Once get_signal_to_deliver() or any other caller drops ->siglock
> >    another thread can do sys_timer_delete()->sigqueue_free().
> 
> How the devil would it find the sucker?  It's off the list already.

Ouch...  I think I see what you mean.  Let me see if I got it right:
timer->sigq is *not* freed by collect_signal(); it's done by
release_posix_timer() instead, under siglock.  Frankly, this
        /*
         * If it is queued it will be freed when dequeued,
         * like the "regular" sigqueue.
         */
        if (!list_empty(&q->list))
                q = NULL;
in sigqueue_free() smells like it's asking for races.  Sigh...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 15:36                                       ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:36 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> 
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > 
> >    Once get_signal_to_deliver() or any other caller drops ->siglock
> >    another thread can do sys_timer_delete()->sigqueue_free().
> 
> How the devil would it find the sucker?  It's off the list already.

Ouch...  I think I see what you mean.  Let me see if I got it right:
timer->sigq is *not* freed by collect_signal(); it's done by
release_posix_timer() instead, under siglock.  Frankly, this
        /*
         * If it is queued it will be freed when dequeued,
         * like the "regular" sigqueue.
         */
        if (!list_empty(&q->list))
                q = NULL;
in sigqueue_free() smells like it's asking for races.  Sigh...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 15:36                                       ` Al Viro
@ 2014-02-15 15:58                                         ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:36:31PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> > 
> > > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > > 
> > >    Once get_signal_to_deliver() or any other caller drops ->siglock
> > >    another thread can do sys_timer_delete()->sigqueue_free().
> > 
> > How the devil would it find the sucker?  It's off the list already.
> 
> Ouch...  I think I see what you mean.  Let me see if I got it right:
> timer->sigq is *not* freed by collect_signal(); it's done by
> release_posix_timer() instead, under siglock.  Frankly, this
>         /*
>          * If it is queued it will be freed when dequeued,
>          * like the "regular" sigqueue.
>          */
>         if (!list_empty(&q->list))
>                 q = NULL;
> in sigqueue_free() smells like it's asking for races.  Sigh...

So basically we want a different condition for "can we just go ahead and
free that sucker", right?  Instead of "it's on the list, shan't free it"
it ought to be something like "it's on the list or it is referenced by
ksiginfo".  Locking will be interesting, though... ;-/

BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
something blocked, shoves it back with PTRACE_SETSIGINFO and does
PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
with the same ->si_tid, one of them matching the timer->sigq and another
- not?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 15:58                                         ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 15:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Sat, Feb 15, 2014 at 03:36:31PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> > 
> > > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > > 
> > >    Once get_signal_to_deliver() or any other caller drops ->siglock
> > >    another thread can do sys_timer_delete()->sigqueue_free().
> > 
> > How the devil would it find the sucker?  It's off the list already.
> 
> Ouch...  I think I see what you mean.  Let me see if I got it right:
> timer->sigq is *not* freed by collect_signal(); it's done by
> release_posix_timer() instead, under siglock.  Frankly, this
>         /*
>          * If it is queued it will be freed when dequeued,
>          * like the "regular" sigqueue.
>          */
>         if (!list_empty(&q->list))
>                 q = NULL;
> in sigqueue_free() smells like it's asking for races.  Sigh...

So basically we want a different condition for "can we just go ahead and
free that sucker", right?  Instead of "it's on the list, shan't free it"
it ought to be something like "it's on the list or it is referenced by
ksiginfo".  Locking will be interesting, though... ;-/

BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
something blocked, shoves it back with PTRACE_SETSIGINFO and does
PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
with the same ->si_tid, one of them matching the timer->sigq and another
- not?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 15:58                                         ` Al Viro
@ 2014-02-15 16:59                                           ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 16:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:58:38PM +0000, Al Viro wrote:

> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

I wonder if it would be simpler to use the pointer *only* for si_code < 0
case.  It still gives us ksiginfo_t much smaller than siginfo_t, avoids
all the mess with timers and AFAICS results in less intrusive patch.
IOW, the split between "we know what that sucker is" and "completely
opaque shit the userland has given us".

I'll try to put together something along those lines and see how well does
that work...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 16:59                                           ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 16:59 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Sat, Feb 15, 2014 at 03:58:38PM +0000, Al Viro wrote:

> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

I wonder if it would be simpler to use the pointer *only* for si_code < 0
case.  It still gives us ksiginfo_t much smaller than siginfo_t, avoids
all the mess with timers and AFAICS results in less intrusive patch.
IOW, the split between "we know what that sucker is" and "completely
opaque shit the userland has given us".

I'll try to put together something along those lines and see how well does
that work...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 15:58                                         ` Al Viro
@ 2014-02-15 17:43                                           ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 17:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/15, Al Viro wrote:
>
> > Ouch...  I think I see what you mean.  Let me see if I got it right:
> > timer->sigq is *not* freed by collect_signal(); it's done by
> > release_posix_timer() instead, under siglock.  Frankly, this
> >         /*
> >          * If it is queued it will be freed when dequeued,
> >          * like the "regular" sigqueue.
> >          */
> >         if (!list_empty(&q->list))
> >                 q = NULL;
> > in sigqueue_free() smells like it's asking for races.  Sigh...

This is protected by ->siglock, should be safe...

> So basically we want a different condition for "can we just go ahead and
> free that sucker", right?  Instead of "it's on the list, shan't free it"
> it ought to be something like "it's on the list or it is referenced by
> ksiginfo".  Locking will be interesting, though... ;-/

I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.


> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

Or the task sends a SI_TIMER info to itself via sys_rt_sigqueueinfo().

Afaics, nothing really bad can happen, I mean the kernel should not
crash or something like this. do_schedule_next_timer() can be fooled,
but at least lock_timer() can only succeed if this process actually
has a timer with the same timer_id. This sigqueue != timer->sigq, but
I think this doesn't matter, posix_timer_event() will use timer->sigq
anyway.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 17:43                                           ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 17:43 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/15, Al Viro wrote:
>
> > Ouch...  I think I see what you mean.  Let me see if I got it right:
> > timer->sigq is *not* freed by collect_signal(); it's done by
> > release_posix_timer() instead, under siglock.  Frankly, this
> >         /*
> >          * If it is queued it will be freed when dequeued,
> >          * like the "regular" sigqueue.
> >          */
> >         if (!list_empty(&q->list))
> >                 q = NULL;
> > in sigqueue_free() smells like it's asking for races.  Sigh...

This is protected by ->siglock, should be safe...

> So basically we want a different condition for "can we just go ahead and
> free that sucker", right?  Instead of "it's on the list, shan't free it"
> it ought to be something like "it's on the list or it is referenced by
> ksiginfo".  Locking will be interesting, though... ;-/

I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.


> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

Or the task sends a SI_TIMER info to itself via sys_rt_sigqueueinfo().

Afaics, nothing really bad can happen, I mean the kernel should not
crash or something like this. do_schedule_next_timer() can be fooled,
but at least lock_timer() can only succeed if this process actually
has a timer with the same timer_id. This sigqueue != timer->sigq, but
I think this doesn't matter, posix_timer_event() will use timer->sigq
anyway.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 17:43                                           ` Oleg Nesterov
@ 2014-02-15 18:05                                             ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > So basically we want a different condition for "can we just go ahead and
> > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > it ought to be something like "it's on the list or it is referenced by
> > ksiginfo".  Locking will be interesting, though... ;-/
> 
> I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.

The trouble being, we might end up with
	Q picked by collect_signal and and stuff into ksiginfo
	Q resubmitted by timer code
	Q picked by *another* thread into its ksiginfo
	the first thread finally done with signal
and at that point we still have a reference in the second thread's ksiginfo.
Hell knows - my first reflex in that kind of situation is to replace
that flag with refcount, so that timer code would hold a reference from
timer_create(2) to timer_delete(2), send_sigqueue() would bump it and
dismiss_siginfo() - drop the sucker.  But that means either grabbing
siglock in dismiss_siginfo() or making the counter atomic; either way
it's a cacheline ping-pong.  Atomic counter is less painful in that respect -
it would be right next to the list, so we dirty that cacheline anyway...

I'm still trying another approach (slightly bigger ksiginfo used to store
all variants with si_code >= 0), but it has messiness of its own; we'll
see how it goes...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 18:05                                             ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-15 18:05 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > So basically we want a different condition for "can we just go ahead and
> > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > it ought to be something like "it's on the list or it is referenced by
> > ksiginfo".  Locking will be interesting, though... ;-/
> 
> I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.

The trouble being, we might end up with
	Q picked by collect_signal and and stuff into ksiginfo
	Q resubmitted by timer code
	Q picked by *another* thread into its ksiginfo
	the first thread finally done with signal
and at that point we still have a reference in the second thread's ksiginfo.
Hell knows - my first reflex in that kind of situation is to replace
that flag with refcount, so that timer code would hold a reference from
timer_create(2) to timer_delete(2), send_sigqueue() would bump it and
dismiss_siginfo() - drop the sucker.  But that means either grabbing
siglock in dismiss_siginfo() or making the counter atomic; either way
it's a cacheline ping-pong.  Atomic counter is less painful in that respect -
it would be right next to the list, so we dirty that cacheline anyway...

I'm still trying another approach (slightly bigger ksiginfo used to store
all variants with si_code >= 0), but it has messiness of its own; we'll
see how it goes...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 18:05                                             ` Al Viro
@ 2014-02-15 18:45                                               ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 18:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > So basically we want a different condition for "can we just go ahead and
> > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > it ought to be something like "it's on the list or it is referenced by
> > > ksiginfo".  Locking will be interesting, though... ;-/
> >
> > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
>
> The trouble being, we might end up with
> 	Q picked by collect_signal and and stuff into ksiginfo
> 	Q resubmitted by timer code

In this case the timer code should simply inc ->si_overrun and do nothing.

IOW, list_empty() should be turned into is_queued(), and is_queued()
should be true until dismiss_siginfo() which should also do
do_schedule_next_timer(). I think.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 18:45                                               ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-15 18:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > So basically we want a different condition for "can we just go ahead and
> > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > it ought to be something like "it's on the list or it is referenced by
> > > ksiginfo".  Locking will be interesting, though... ;-/
> >
> > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
>
> The trouble being, we might end up with
> 	Q picked by collect_signal and and stuff into ksiginfo
> 	Q resubmitted by timer code

In this case the timer code should simply inc ->si_overrun and do nothing.

IOW, list_empty() should be turned into is_queued(), and is_queued()
should be true until dismiss_siginfo() which should also do
do_schedule_next_timer(). I think.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14 16:01                         ` Dave Jones
@ 2014-02-15 22:23                           ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-15 22:23 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Eric Sandeen, Linux Kernel, xfs

On Fri, Feb 14, 2014 at 11:01:23AM -0500, Dave Jones wrote:
> On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
>  
>  > > I can fix this one easily - we already have a workqueue for doing
>  > > async log pushes (will split the stack between xlog_cil_force_lsn
>  > > and xlog_cil_push), but the reason we haven't used it for synchronous
>  > > log forces is that screws up fsync performance on CFQ. We don't
>  > > recommend CFQ with XFS anyway, so I think I'll make this change
>  > > anyway.
>  > 
>  > Dave, the patch below should chop off the stack usage from
>  > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
>  > Can you given this a run?
> 
> Looks like it's survived an overnight run..

Great.

One thing that has been puzzling me is why I'm seeing stack usage
reported that is way above what we declare locally on the stack.
e.g.

 29)     2048     224   xfs_da_grow_inode_int+0xbb/0x340
 30)     1824      96   xfs_dir2_grow_inode+0x63/0x110
 31)     1728     208   xfs_dir2_sf_to_block+0xe7/0x5e0
 32)     1520     144   xfs_dir2_sf_addname+0x115/0x5c0
 33)     1376      96   xfs_dir_createname+0x164/0x1a0
 34)     1280     224   xfs_create+0x536/0x660
 35)     1056     128   xfs_vn_mknod+0xc8/0x1d0

I pulled 128 bytes out of xfs_dir_createname by allocating the
xfs_da_args structure, but I still couldn't reconcile the amount of
stack use being reported with the amount used by locally declared
variables (even considering in-lined leaf functions). For example:

Locally
Declared	Used   function
  88             224   xfs_da_grow_inode_int
  32              96   xfs_dir2_grow_inode
 168             208   xfs_dir2_sf_to_block
  56             144   xfs_dir2_sf_addname
  16              96   xfs_dir_createname
 120             224   xfs_create
  52             128   xfs_vn_mknod

There's a pretty massive difference between the actual stack usage
of the local variables and the amount of stack being used by the
compiled code.

What it appears to be is that the compiler is pushing 6-10 registers
to the stack on every function call. So a function that only has 3
local variables and does very little but allocate a structure and
call other functions saves an 6 registers to the stack before it
starts:

Dump of assembler code for function xfs_dir_createname:
214     {
   0xffffffff814d7380 <+0>:     callq  0xffffffff81cf0940
   0xffffffff814d7385 <+5>:     push   %rbp
   0xffffffff814d7386 <+6>:     mov    %rsp,%rbp
   0xffffffff814d7389 <+9>:     sub    $0x50,%rsp
   0xffffffff814d738d <+13>:    mov    %rbx,-0x28(%rbp)
   0xffffffff814d7391 <+17>:    mov    %rdi,%rbx
   0xffffffff814d7394 <+20>:    mov    %r12,-0x20(%rbp)
   0xffffffff814d7398 <+24>:    mov    %rcx,%r12
   0xffffffff814d739b <+27>:    mov    %r13,-0x18(%rbp)
   0xffffffff814d739f <+31>:    mov    %rsi,%r13
   0xffffffff814d73a5 <+37>:    mov    %r14,-0x10(%rbp)
   0xffffffff814d73a9 <+41>:    mov    %rdx,%r14
   0xffffffff814d73ac <+44>:    mov    %r15,-0x8(%rbp)
   0xffffffff814d73b0 <+48>:    mov    %r8,%r15
   0xffffffff814d73b7 <+55>:    mov    %r9,-0x48(%rbp)
.....

If this is typical across the call chain (appears to be from my
quick survey) then we are averaging 40-50 bytes of stack per
call for saved registers. That's a lot of stack space we can't
directly control the usage of, especially when we are talking about
call chains that can get 50+ functions deep...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 22:23                           ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-15 22:23 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Eric Sandeen, Linux Kernel, xfs

On Fri, Feb 14, 2014 at 11:01:23AM -0500, Dave Jones wrote:
> On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
>  
>  > > I can fix this one easily - we already have a workqueue for doing
>  > > async log pushes (will split the stack between xlog_cil_force_lsn
>  > > and xlog_cil_push), but the reason we haven't used it for synchronous
>  > > log forces is that screws up fsync performance on CFQ. We don't
>  > > recommend CFQ with XFS anyway, so I think I'll make this change
>  > > anyway.
>  > 
>  > Dave, the patch below should chop off the stack usage from
>  > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
>  > Can you given this a run?
> 
> Looks like it's survived an overnight run..

Great.

One thing that has been puzzling me is why I'm seeing stack usage
reported that is way above what we declare locally on the stack.
e.g.

 29)     2048     224   xfs_da_grow_inode_int+0xbb/0x340
 30)     1824      96   xfs_dir2_grow_inode+0x63/0x110
 31)     1728     208   xfs_dir2_sf_to_block+0xe7/0x5e0
 32)     1520     144   xfs_dir2_sf_addname+0x115/0x5c0
 33)     1376      96   xfs_dir_createname+0x164/0x1a0
 34)     1280     224   xfs_create+0x536/0x660
 35)     1056     128   xfs_vn_mknod+0xc8/0x1d0

I pulled 128 bytes out of xfs_dir_createname by allocating the
xfs_da_args structure, but I still couldn't reconcile the amount of
stack use being reported with the amount used by locally declared
variables (even considering in-lined leaf functions). For example:

Locally
Declared	Used   function
  88             224   xfs_da_grow_inode_int
  32              96   xfs_dir2_grow_inode
 168             208   xfs_dir2_sf_to_block
  56             144   xfs_dir2_sf_addname
  16              96   xfs_dir_createname
 120             224   xfs_create
  52             128   xfs_vn_mknod

There's a pretty massive difference between the actual stack usage
of the local variables and the amount of stack being used by the
compiled code.

What it appears to be is that the compiler is pushing 6-10 registers
to the stack on every function call. So a function that only has 3
local variables and does very little but allocate a structure and
call other functions saves an 6 registers to the stack before it
starts:

Dump of assembler code for function xfs_dir_createname:
214     {
   0xffffffff814d7380 <+0>:     callq  0xffffffff81cf0940
   0xffffffff814d7385 <+5>:     push   %rbp
   0xffffffff814d7386 <+6>:     mov    %rsp,%rbp
   0xffffffff814d7389 <+9>:     sub    $0x50,%rsp
   0xffffffff814d738d <+13>:    mov    %rbx,-0x28(%rbp)
   0xffffffff814d7391 <+17>:    mov    %rdi,%rbx
   0xffffffff814d7394 <+20>:    mov    %r12,-0x20(%rbp)
   0xffffffff814d7398 <+24>:    mov    %rcx,%r12
   0xffffffff814d739b <+27>:    mov    %r13,-0x18(%rbp)
   0xffffffff814d739f <+31>:    mov    %rsi,%r13
   0xffffffff814d73a5 <+37>:    mov    %r14,-0x10(%rbp)
   0xffffffff814d73a9 <+41>:    mov    %rdx,%r14
   0xffffffff814d73ac <+44>:    mov    %r15,-0x8(%rbp)
   0xffffffff814d73b0 <+48>:    mov    %r8,%r15
   0xffffffff814d73b7 <+55>:    mov    %r9,-0x48(%rbp)
.....

If this is typical across the call chain (appears to be from my
quick survey) then we are averaging 40-50 bytes of stack per
call for saved registers. That's a lot of stack space we can't
directly control the usage of, especially when we are talking about
call chains that can get 50+ functions deep...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 22:23                           ` Dave Chinner
@ 2014-02-15 22:28                             ` Dave Jones
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-15 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, Al Viro, Eric Sandeen, Linux Kernel, xfs

On Sun, Feb 16, 2014 at 09:23:56AM +1100, Dave Chinner wrote:
 
 > There's a pretty massive difference between the actual stack usage
 > of the local variables and the amount of stack being used by the
 > compiled code.
 > 
 > What it appears to be is that the compiler is pushing 6-10 registers
 > to the stack on every function call. So a function that only has 3
 > local variables and does very little but allocate a structure and
 > call other functions saves an 6 registers to the stack before it
 > starts:

I've got a shitload of debug options enabled, which may explain it.
Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

	Dave

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 22:28                             ` Dave Jones
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Jones @ 2014-02-15 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs, Linus Torvalds, Al Viro, Linux Kernel

On Sun, Feb 16, 2014 at 09:23:56AM +1100, Dave Chinner wrote:
 
 > There's a pretty massive difference between the actual stack usage
 > of the local variables and the amount of stack being used by the
 > compiled code.
 > 
 > What it appears to be is that the compiler is pushing 6-10 registers
 > to the stack on every function call. So a function that only has 3
 > local variables and does very little but allocate a structure and
 > call other functions saves an 6 registers to the stack before it
 > starts:

I've got a shitload of debug options enabled, which may explain it.
Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 22:28                             ` Dave Jones
@ 2014-02-15 22:43                               ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-15 22:43 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Linus Torvalds, Al Viro, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 2:28 PM, Dave Jones <davej@redhat.com> wrote:
>
> I've got a shitload of debug options enabled, which may explain it.
> Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Well, a lot of it is just the callee-saved registers. The compiler
will tend to preferentially allocate registers in the callee-trashed
registers, but if the function isn't a leaf function, any registers
that are live around a function call will have to be saved somewhere -
either explicitly around the function call, or - more likely - in
callee-saved registers that then get saved in the prologue/epilogue of
the function. And this will happen even in leaf functions when there
is enough register pressure that the callee-trashed registers aren't
sufficient (which is pretty common).

So saving 5-6 registers on the stack (in addition to any actual stack
frame) is pretty much the norm for anything but the very simplest
cases.

But yeah, I'm sure some config options make it worse.
STACK_PROTECTOR_STRONG could easily be one of those.

            Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 22:43                               ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-15 22:43 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Linus Torvalds, Al Viro, Eric Sandeen,
	Linux Kernel, xfs

On Sat, Feb 15, 2014 at 2:28 PM, Dave Jones <davej@redhat.com> wrote:
>
> I've got a shitload of debug options enabled, which may explain it.
> Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Well, a lot of it is just the callee-saved registers. The compiler
will tend to preferentially allocate registers in the callee-trashed
registers, but if the function isn't a leaf function, any registers
that are live around a function call will have to be saved somewhere -
either explicitly around the function call, or - more likely - in
callee-saved registers that then get saved in the prologue/epilogue of
the function. And this will happen even in leaf functions when there
is enough register pressure that the callee-trashed registers aren't
sufficient (which is pretty common).

So saving 5-6 registers on the stack (in addition to any actual stack
frame) is pretty much the norm for anything but the very simplest
cases.

But yeah, I'm sure some config options make it worse.
STACK_PROTECTOR_STRONG could easily be one of those.

            Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-14  0:24                       ` Dave Chinner
  (?)
@ 2014-02-15 23:50                         ` Linus Torvalds
  -1 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-15 23:50 UTC (permalink / raw)
  To: Dave Chinner, linux-mm
  Cc: Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

[ Added linux-mm to the participants list ]

On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
people, your parents were some seriously boring people, were they not?
We've got too many Dave's around), but DaveC earlier pointed out that
pretty much any memory allocation path can end up using 3kB of stack
even without XFS being involved.

Which does bring up the question whether we should look (once more) at
the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
reclaim..

Direct reclaim historically used to be an important throttling
mechanism, and I used to not be a fan of trying to avoid direct
reclaim. But the stack depth issue really looks to be pretty bad, and
I think we've gotten better at throttling explicitly, so..

I *think* we already limit filesystem writeback to just kswapd (in
shrink_page_list()), but DaveC posted a backtrace that goes through
do_try_to_free_pages() to shrink_slab(), and through there to the
filesystem and then IO. That looked like a disaster.

And that's because (if I read things right) shrink_page_list() limits
filesystem page writeback to kswapd, but not swap pages. Which I think
probably made more sense back in the days than it does now (I
certainly *hope* that swapping is less important today than it was,
say, ten years ago)

So I'm wondering whether we should remove that page_is_file_cache()
check from shrink_page_list()?

And then there is that whole shrink_slab() case...

               Linus

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 23:50                         ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-15 23:50 UTC (permalink / raw)
  To: Dave Chinner, linux-mm
  Cc: Dave Jones, xfs, Eric Sandeen, Al Viro, Linux Kernel

[ Added linux-mm to the participants list ]

On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
people, your parents were some seriously boring people, were they not?
We've got too many Dave's around), but DaveC earlier pointed out that
pretty much any memory allocation path can end up using 3kB of stack
even without XFS being involved.

Which does bring up the question whether we should look (once more) at
the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
reclaim..

Direct reclaim historically used to be an important throttling
mechanism, and I used to not be a fan of trying to avoid direct
reclaim. But the stack depth issue really looks to be pretty bad, and
I think we've gotten better at throttling explicitly, so..

I *think* we already limit filesystem writeback to just kswapd (in
shrink_page_list()), but DaveC posted a backtrace that goes through
do_try_to_free_pages() to shrink_slab(), and through there to the
filesystem and then IO. That looked like a disaster.

And that's because (if I read things right) shrink_page_list() limits
filesystem page writeback to kswapd, but not swap pages. Which I think
probably made more sense back in the days than it does now (I
certainly *hope* that swapping is less important today than it was,
say, ten years ago)

So I'm wondering whether we should remove that page_is_file_cache()
check from shrink_page_list()?

And then there is that whole shrink_slab() case...

               Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-15 23:50                         ` Linus Torvalds
  0 siblings, 0 replies; 132+ messages in thread
From: Linus Torvalds @ 2014-02-15 23:50 UTC (permalink / raw)
  To: Dave Chinner, linux-mm
  Cc: Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

[ Added linux-mm to the participants list ]

On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
people, your parents were some seriously boring people, were they not?
We've got too many Dave's around), but DaveC earlier pointed out that
pretty much any memory allocation path can end up using 3kB of stack
even without XFS being involved.

Which does bring up the question whether we should look (once more) at
the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
reclaim..

Direct reclaim historically used to be an important throttling
mechanism, and I used to not be a fan of trying to avoid direct
reclaim. But the stack depth issue really looks to be pretty bad, and
I think we've gotten better at throttling explicitly, so..

I *think* we already limit filesystem writeback to just kswapd (in
shrink_page_list()), but DaveC posted a backtrace that goes through
do_try_to_free_pages() to shrink_slab(), and through there to the
filesystem and then IO. That looked like a disaster.

And that's because (if I read things right) shrink_page_list() limits
filesystem page writeback to kswapd, but not swap pages. Which I think
probably made more sense back in the days than it does now (I
certainly *hope* that swapping is less important today than it was,
say, ten years ago)

So I'm wondering whether we should remove that page_is_file_cache()
check from shrink_page_list()?

And then there is that whole shrink_slab() case...

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 18:45                                               ` Oleg Nesterov
@ 2014-02-17 16:57                                                 ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-17 16:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/15, Oleg Nesterov wrote:
>
> On 02/15, Al Viro wrote:
> >
> > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > > So basically we want a different condition for "can we just go ahead and
> > > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > > it ought to be something like "it's on the list or it is referenced by
> > > > ksiginfo".  Locking will be interesting, though... ;-/
> > >
> > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
> >
> > The trouble being, we might end up with
> > 	Q picked by collect_signal and and stuff into ksiginfo
> > 	Q resubmitted by timer code
>
> In this case the timer code should simply inc ->si_overrun and do nothing.
>
> IOW, list_empty() should be turned into is_queued(), and is_queued()
> should be true until dismiss_siginfo() which should also do
> do_schedule_next_timer(). I think.

No, this is even more complicated. We also need do_schedule_next_timer()
to calculate ->si_overrun we are going to report, I missed this...

Looks like, this is all is really nasty. Actually, I think siginfo on
stack is not that bad if we are going to do handle_signal() or restart,
perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
Something like below.

Oleg.


diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9e5de68..52f16f9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -684,6 +684,52 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	test_thread_flag(TIF_IA32) ? __NR_ia32_restart_syscall : __NR_restart_syscall
 #endif /* CONFIG_X86_32 */
 
+static noinline int process_signal(struct pt_regs *regs, siginfo_t **pinfo)
+{
+	struct ksignal ksig;
+
+	switch (get_signal(&ksig)) {
+	case SIG_DUMP:
+		*pinfo = kmalloc(sizeof(siginfo_t), GFP_KERNEL);
+		if (*pinfo)
+			copy_siginfo(*pinfo, &ksig.info);
+
+	case SIG_EXIT:
+		return ksig.info.si_signo;
+
+	default:
+		handle_signal(&ksig, regs);
+		break;
+
+	case 0:
+		/* Did we come from a system call? */
+		if (syscall_get_nr(current, regs) >= 0) {
+			/* Restart the system call - no handlers present */
+			switch (syscall_get_error(current, regs)) {
+			case -ERESTARTNOHAND:
+			case -ERESTARTSYS:
+			case -ERESTARTNOINTR:
+				regs->ax = regs->orig_ax;
+				regs->ip -= 2;
+				break;
+
+			case -ERESTART_RESTARTBLOCK:
+				regs->ax = NR_restart_syscall;
+				regs->ip -= 2;
+				break;
+			}
+		}
+
+		/*
+		 * If there's no signal to deliver, we just put the saved sigmask
+		 * back.
+		 */
+		restore_saved_sigmask();
+	}
+
+	return 0;
+}
+
 /*
  * Note that 'init' is a special process: it doesn't get signals it doesn't
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
@@ -691,37 +737,16 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
  */
 static void do_signal(struct pt_regs *regs)
 {
-	struct ksignal ksig;
+	siginfo_t *info = NULL;
+	int sig = process_signal(regs, &info);
 
-	if (get_signal(&ksig)) {
-		/* Whee! Actually deliver the signal.  */
-		handle_signal(&ksig, regs);
-		return;
-	}
-
-	/* Did we come from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
-		/* Restart the system call - no handlers present */
-		switch (syscall_get_error(current, regs)) {
-		case -ERESTARTNOHAND:
-		case -ERESTARTSYS:
-		case -ERESTARTNOINTR:
-			regs->ax = regs->orig_ax;
-			regs->ip -= 2;
-			break;
-
-		case -ERESTART_RESTARTBLOCK:
-			regs->ax = NR_restart_syscall;
-			regs->ip -= 2;
-			break;
+	if (sig) {
+		if (info) {
+			do_coredump(info);
+			kfree(info);
 		}
+		do_group_exit(sig);
 	}
-
-	/*
-	 * If there's no signal to deliver, we just put the saved sigmask
-	 * back.
-	 */
-	restore_saved_sigmask();
 }
 
 /*
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..33b5e04 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -285,6 +285,9 @@ struct ksignal {
 	int sig;
 };
 
+#define SIG_EXIT	-1
+#define SIG_DUMP	-2
+
 extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
 extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
 extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
@@ -299,7 +302,7 @@ extern void exit_signals(struct task_struct *tsk);
 	struct ksignal *p = (ksig);				\
 	p->sig = get_signal_to_deliver(&p->info, &p->ka,	\
 					signal_pt_regs(), NULL);\
-	p->sig > 0;						\
+	p->sig;							\
 })
 
 extern struct kmem_cache *sighand_cachep;
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..8a4c4a3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2353,22 +2353,11 @@ relock:
 			if (print_fatal_signals)
 				print_fatal_signal(info->si_signo);
 			proc_coredump_connector(current);
-			/*
-			 * If it was able to dump core, this kills all
-			 * other threads in the group and synchronizes with
-			 * their demise.  If we lost the race with another
-			 * thread getting here, it set group_exit_code
-			 * first and our do_group_exit call below will use
-			 * that value and ignore the one we pass it.
-			 */
-			do_coredump(info);
+			return SIG_DUMP;
 		}
 
-		/*
-		 * Death signals, no core dump.
-		 */
-		do_group_exit(info->si_signo);
-		/* NOTREACHED */
+		return SIG_EXIT;
+
 	}
 	spin_unlock_irq(&sighand->siglock);
 	return signr;


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-17 16:57                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-17 16:57 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/15, Oleg Nesterov wrote:
>
> On 02/15, Al Viro wrote:
> >
> > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > > So basically we want a different condition for "can we just go ahead and
> > > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > > it ought to be something like "it's on the list or it is referenced by
> > > > ksiginfo".  Locking will be interesting, though... ;-/
> > >
> > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
> >
> > The trouble being, we might end up with
> > 	Q picked by collect_signal and and stuff into ksiginfo
> > 	Q resubmitted by timer code
>
> In this case the timer code should simply inc ->si_overrun and do nothing.
>
> IOW, list_empty() should be turned into is_queued(), and is_queued()
> should be true until dismiss_siginfo() which should also do
> do_schedule_next_timer(). I think.

No, this is even more complicated. We also need do_schedule_next_timer()
to calculate ->si_overrun we are going to report, I missed this...

Looks like, this is all is really nasty. Actually, I think siginfo on
stack is not that bad if we are going to do handle_signal() or restart,
perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
Something like below.

Oleg.


diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9e5de68..52f16f9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -684,6 +684,52 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	test_thread_flag(TIF_IA32) ? __NR_ia32_restart_syscall : __NR_restart_syscall
 #endif /* CONFIG_X86_32 */
 
+static noinline int process_signal(struct pt_regs *regs, siginfo_t **pinfo)
+{
+	struct ksignal ksig;
+
+	switch (get_signal(&ksig)) {
+	case SIG_DUMP:
+		*pinfo = kmalloc(sizeof(siginfo_t), GFP_KERNEL);
+		if (*pinfo)
+			copy_siginfo(*pinfo, &ksig.info);
+
+	case SIG_EXIT:
+		return ksig.info.si_signo;
+
+	default:
+		handle_signal(&ksig, regs);
+		break;
+
+	case 0:
+		/* Did we come from a system call? */
+		if (syscall_get_nr(current, regs) >= 0) {
+			/* Restart the system call - no handlers present */
+			switch (syscall_get_error(current, regs)) {
+			case -ERESTARTNOHAND:
+			case -ERESTARTSYS:
+			case -ERESTARTNOINTR:
+				regs->ax = regs->orig_ax;
+				regs->ip -= 2;
+				break;
+
+			case -ERESTART_RESTARTBLOCK:
+				regs->ax = NR_restart_syscall;
+				regs->ip -= 2;
+				break;
+			}
+		}
+
+		/*
+		 * If there's no signal to deliver, we just put the saved sigmask
+		 * back.
+		 */
+		restore_saved_sigmask();
+	}
+
+	return 0;
+}
+
 /*
  * Note that 'init' is a special process: it doesn't get signals it doesn't
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
@@ -691,37 +737,16 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
  */
 static void do_signal(struct pt_regs *regs)
 {
-	struct ksignal ksig;
+	siginfo_t *info = NULL;
+	int sig = process_signal(regs, &info);
 
-	if (get_signal(&ksig)) {
-		/* Whee! Actually deliver the signal.  */
-		handle_signal(&ksig, regs);
-		return;
-	}
-
-	/* Did we come from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
-		/* Restart the system call - no handlers present */
-		switch (syscall_get_error(current, regs)) {
-		case -ERESTARTNOHAND:
-		case -ERESTARTSYS:
-		case -ERESTARTNOINTR:
-			regs->ax = regs->orig_ax;
-			regs->ip -= 2;
-			break;
-
-		case -ERESTART_RESTARTBLOCK:
-			regs->ax = NR_restart_syscall;
-			regs->ip -= 2;
-			break;
+	if (sig) {
+		if (info) {
+			do_coredump(info);
+			kfree(info);
 		}
+		do_group_exit(sig);
 	}
-
-	/*
-	 * If there's no signal to deliver, we just put the saved sigmask
-	 * back.
-	 */
-	restore_saved_sigmask();
 }
 
 /*
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..33b5e04 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -285,6 +285,9 @@ struct ksignal {
 	int sig;
 };
 
+#define SIG_EXIT	-1
+#define SIG_DUMP	-2
+
 extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
 extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
 extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
@@ -299,7 +302,7 @@ extern void exit_signals(struct task_struct *tsk);
 	struct ksignal *p = (ksig);				\
 	p->sig = get_signal_to_deliver(&p->info, &p->ka,	\
 					signal_pt_regs(), NULL);\
-	p->sig > 0;						\
+	p->sig;							\
 })
 
 extern struct kmem_cache *sighand_cachep;
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..8a4c4a3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2353,22 +2353,11 @@ relock:
 			if (print_fatal_signals)
 				print_fatal_signal(info->si_signo);
 			proc_coredump_connector(current);
-			/*
-			 * If it was able to dump core, this kills all
-			 * other threads in the group and synchronizes with
-			 * their demise.  If we lost the race with another
-			 * thread getting here, it set group_exit_code
-			 * first and our do_group_exit call below will use
-			 * that value and ignore the one we pass it.
-			 */
-			do_coredump(info);
+			return SIG_DUMP;
 		}
 
-		/*
-		 * Death signals, no core dump.
-		 */
-		do_group_exit(info->si_signo);
-		/* NOTREACHED */
+		return SIG_EXIT;
+
 	}
 	spin_unlock_irq(&sighand->siglock);
 	return signr;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-17 16:57                                                 ` Oleg Nesterov
@ 2014-02-17 17:40                                                   ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-17 17:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:

> Looks like, this is all is really nasty. Actually, I think siginfo on
> stack is not that bad if we are going to do handle_signal() or restart,
> perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> Something like below.

Yecchhhh...  You've just broken every architecture other than x86, and to
fix them you'll need to massage every get_signal()/get_signal_to_deliver()
user out there, pulling the logics *out* of kernel/signal.c and into arch/*.
This is just plain wrong.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-17 17:40                                                   ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-17 17:40 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:

> Looks like, this is all is really nasty. Actually, I think siginfo on
> stack is not that bad if we are going to do handle_signal() or restart,
> perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> Something like below.

Yecchhhh...  You've just broken every architecture other than x86, and to
fix them you'll need to massage every get_signal()/get_signal_to_deliver()
user out there, pulling the logics *out* of kernel/signal.c and into arch/*.
This is just plain wrong.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-17 17:40                                                   ` Al Viro
@ 2014-02-17 17:46                                                     ` Oleg Nesterov
  -1 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-17 17:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On 02/17, Al Viro wrote:
>
> On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:
>
> > Looks like, this is all is really nasty. Actually, I think siginfo on
> > stack is not that bad if we are going to do handle_signal() or restart,
> > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> > Something like below.
>
> Yecchhhh...  You've just broken every architecture other than x86,

Of course, this is only to explain what I meant.

> and to
> fix them you'll need to massage every get_signal()/get_signal_to_deliver()
> user out there,

Yes.

> pulling the logics *out* of kernel/signal.c and into arch/*.

Not really, I think. Of course this change should be cleanuped. And it
should not require to change all architectures at once.

> This is just plain wrong.

I agree, this change is also ugly.

Oleg.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-17 17:46                                                     ` Oleg Nesterov
  0 siblings, 0 replies; 132+ messages in thread
From: Oleg Nesterov @ 2014-02-17 17:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On 02/17, Al Viro wrote:
>
> On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:
>
> > Looks like, this is all is really nasty. Actually, I think siginfo on
> > stack is not that bad if we are going to do handle_signal() or restart,
> > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> > Something like below.
>
> Yecchhhh...  You've just broken every architecture other than x86,

Of course, this is only to explain what I meant.

> and to
> fix them you'll need to massage every get_signal()/get_signal_to_deliver()
> user out there,

Yes.

> pulling the logics *out* of kernel/signal.c and into arch/*.

Not really, I think. Of course this change should be cleanuped. And it
should not require to change all architectures at once.

> This is just plain wrong.

I agree, this change is also ugly.

Oleg.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-17 17:46                                                     ` Oleg Nesterov
@ 2014-02-17 17:54                                                       ` Al Viro
  -1 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-17 17:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Chinner, Dave Jones, Eric Sandeen,
	Linux Kernel, xfs

On Mon, Feb 17, 2014 at 06:46:48PM +0100, Oleg Nesterov wrote:
> > pulling the logics *out* of kernel/signal.c and into arch/*.
> 
> Not really, I think.

How so?  You propose to make all architectures call do_coredump()
themselves, instead of having it done centrally.  It's more boilerplate
for all of them to get wrong.  And IME any boilerplate in that are
*will* be gotten wrong - I've gone through signal handling on all
architectures more than a few times and every time it caught a pile
of bugs...

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-17 17:54                                                       ` Al Viro
  0 siblings, 0 replies; 132+ messages in thread
From: Al Viro @ 2014-02-17 17:54 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Eric Sandeen, Linux Kernel, xfs, Dave Jones, Linus Torvalds

On Mon, Feb 17, 2014 at 06:46:48PM +0100, Oleg Nesterov wrote:
> > pulling the logics *out* of kernel/signal.c and into arch/*.
> 
> Not really, I think.

How so?  You propose to make all architectures call do_coredump()
themselves, instead of having it done centrally.  It's more boilerplate
for all of them to get wrong.  And IME any boilerplate in that are
*will* be gotten wrong - I've gone through signal handling on all
architectures more than a few times and every time it caught a pile
of bugs...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
  2014-02-15 23:50                         ` Linus Torvalds
  (?)
@ 2014-02-18  1:27                           ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-18  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote:
> [ Added linux-mm to the participants list ]
> 
> On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
> 
> Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
> people, your parents were some seriously boring people, were they not?
> We've got too many Dave's around),

It's an exclusive club - we have 'kernel hacker Dave' reunions in
bars around the world. We should get some tshirts made up....  :)

> but DaveC earlier pointed out that
> pretty much any memory allocation path can end up using 3kB of stack
> even without XFS being involved.
> 
> Which does bring up the question whether we should look (once more) at
> the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
> reclaim..

We do that mostly already, but GFP_KERNEL allows swap IO and that's
where the deepest stack I saw came from.

Even if we don't allow IO at all, we're still going to see stack
usage of 2-2.5k in direct reclaim. e.g.  invalidate a page and enter
the rmap code.  The rmap is protected by a mutex, so if we fail to
get that we have about 1.2k of stack consumed from there and that is
on top of the the allocator/reclaim that has already consumes ~1k of
stack...

> Direct reclaim historically used to be an important throttling
> mechanism, and I used to not be a fan of trying to avoid direct
> reclaim. But the stack depth issue really looks to be pretty bad, and
> I think we've gotten better at throttling explicitly, so..
> 
> I *think* we already limit filesystem writeback to just kswapd (in
> shrink_page_list()), but DaveC posted a backtrace that goes through
> do_try_to_free_pages() to shrink_slab(), and through there to the
> filesystem and then IO. That looked like a disaster.

Right, that's an XFS problem, and I'm working on fixing it. The
Patch I sent to DaveJ fixes the worst case, but I need to make it
completely IO-less while still retaining the throttling the IO gives
us.

> And that's because (if I read things right) shrink_page_list() limits
> filesystem page writeback to kswapd, but not swap pages. Which I think
> probably made more sense back in the days than it does now (I
> certainly *hope* that swapping is less important today than it was,
> say, ten years ago)
> 
> So I'm wondering whether we should remove that page_is_file_cache()
> check from shrink_page_list()?

The thing is, the stack usage from the swap IO path is pretty well
bound - it's just the worst case stack of issuing IO. We know it
won't recurse into direct reclaim, so mempool allocation and
blocking is all we need to consider. Compare that to a filesystem
which may need to allocate extents and hence do transactions and
split btrees and read metadata and allocate large amounts of memory
even before it gets to the IO layers.

Hence I suspect that we could do a simple thing like only allow swap
if there's more than half the stack available in the current reclaim
context. Because, let's face it, if the submit_bio path is consuming
more than half the available stack then we're totally screwed from a
filesystem perspective....

> And then there is that whole shrink_slab() case...

I think with shrinkers we just need to be more careful. The XFS
behaviour is all my fault, and I should have known better that to
design code that requires IO in the direct reclaim path. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-18  1:27                           ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-18  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Sandeen, Linux Kernel, xfs, linux-mm, Al Viro, Dave Jones

On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote:
> [ Added linux-mm to the participants list ]
> 
> On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
> 
> Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
> people, your parents were some seriously boring people, were they not?
> We've got too many Dave's around),

It's an exclusive club - we have 'kernel hacker Dave' reunions in
bars around the world. We should get some tshirts made up....  :)

> but DaveC earlier pointed out that
> pretty much any memory allocation path can end up using 3kB of stack
> even without XFS being involved.
> 
> Which does bring up the question whether we should look (once more) at
> the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
> reclaim..

We do that mostly already, but GFP_KERNEL allows swap IO and that's
where the deepest stack I saw came from.

Even if we don't allow IO at all, we're still going to see stack
usage of 2-2.5k in direct reclaim. e.g.  invalidate a page and enter
the rmap code.  The rmap is protected by a mutex, so if we fail to
get that we have about 1.2k of stack consumed from there and that is
on top of the the allocator/reclaim that has already consumes ~1k of
stack...

> Direct reclaim historically used to be an important throttling
> mechanism, and I used to not be a fan of trying to avoid direct
> reclaim. But the stack depth issue really looks to be pretty bad, and
> I think we've gotten better at throttling explicitly, so..
> 
> I *think* we already limit filesystem writeback to just kswapd (in
> shrink_page_list()), but DaveC posted a backtrace that goes through
> do_try_to_free_pages() to shrink_slab(), and through there to the
> filesystem and then IO. That looked like a disaster.

Right, that's an XFS problem, and I'm working on fixing it. The
Patch I sent to DaveJ fixes the worst case, but I need to make it
completely IO-less while still retaining the throttling the IO gives
us.

> And that's because (if I read things right) shrink_page_list() limits
> filesystem page writeback to kswapd, but not swap pages. Which I think
> probably made more sense back in the days than it does now (I
> certainly *hope* that swapping is less important today than it was,
> say, ten years ago)
> 
> So I'm wondering whether we should remove that page_is_file_cache()
> check from shrink_page_list()?

The thing is, the stack usage from the swap IO path is pretty well
bound - it's just the worst case stack of issuing IO. We know it
won't recurse into direct reclaim, so mempool allocation and
blocking is all we need to consider. Compare that to a filesystem
which may need to allocate extents and hence do transactions and
split btrees and read metadata and allocate large amounts of memory
even before it gets to the IO layers.

Hence I suspect that we could do a simple thing like only allow swap
if there's more than half the stack available in the current reclaim
context. Because, let's face it, if the submit_bio path is consuming
more than half the available stack then we're totally screwed from a
filesystem perspective....

> And then there is that whole shrink_slab() case...

I think with shrinkers we just need to be more careful. The XFS
behaviour is all my fault, and I should have known better that to
design code that requires IO in the direct reclaim path. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: 3.14-rc2 XFS backtrace because irqs_disabled.
@ 2014-02-18  1:27                           ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2014-02-18  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Al Viro, Dave Jones, Eric Sandeen, Linux Kernel, xfs

On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote:
> [ Added linux-mm to the participants list ]
> 
> On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
> 
> Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
> people, your parents were some seriously boring people, were they not?
> We've got too many Dave's around),

It's an exclusive club - we have 'kernel hacker Dave' reunions in
bars around the world. We should get some tshirts made up....  :)

> but DaveC earlier pointed out that
> pretty much any memory allocation path can end up using 3kB of stack
> even without XFS being involved.
> 
> Which does bring up the question whether we should look (once more) at
> the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
> reclaim..

We do that mostly already, but GFP_KERNEL allows swap IO and that's
where the deepest stack I saw came from.

Even if we don't allow IO at all, we're still going to see stack
usage of 2-2.5k in direct reclaim. e.g.  invalidate a page and enter
the rmap code.  The rmap is protected by a mutex, so if we fail to
get that we have about 1.2k of stack consumed from there and that is
on top of the the allocator/reclaim that has already consumes ~1k of
stack...

> Direct reclaim historically used to be an important throttling
> mechanism, and I used to not be a fan of trying to avoid direct
> reclaim. But the stack depth issue really looks to be pretty bad, and
> I think we've gotten better at throttling explicitly, so..
> 
> I *think* we already limit filesystem writeback to just kswapd (in
> shrink_page_list()), but DaveC posted a backtrace that goes through
> do_try_to_free_pages() to shrink_slab(), and through there to the
> filesystem and then IO. That looked like a disaster.

Right, that's an XFS problem, and I'm working on fixing it. The
Patch I sent to DaveJ fixes the worst case, but I need to make it
completely IO-less while still retaining the throttling the IO gives
us.

> And that's because (if I read things right) shrink_page_list() limits
> filesystem page writeback to kswapd, but not swap pages. Which I think
> probably made more sense back in the days than it does now (I
> certainly *hope* that swapping is less important today than it was,
> say, ten years ago)
> 
> So I'm wondering whether we should remove that page_is_file_cache()
> check from shrink_page_list()?

The thing is, the stack usage from the swap IO path is pretty well
bound - it's just the worst case stack of issuing IO. We know it
won't recurse into direct reclaim, so mempool allocation and
blocking is all we need to consider. Compare that to a filesystem
which may need to allocate extents and hence do transactions and
split btrees and read metadata and allocate large amounts of memory
even before it gets to the IO layers.

Hence I suspect that we could do a simple thing like only allow swap
if there's more than half the stack available in the current reclaim
context. Because, let's face it, if the submit_bio path is consuming
more than half the available stack then we're totally screwed from a
filesystem perspective....

> And then there is that whole shrink_slab() case...

I think with shrinkers we just need to be more careful. The XFS
behaviour is all my fault, and I should have known better that to
design code that requires IO in the direct reclaim path. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2014-02-18  1:28 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-11 17:27 3.14-rc2 XFS backtrace because irqs_disabled Dave Jones
2014-02-11 17:27 ` Dave Jones
2014-02-11 21:08 ` Dave Chinner
2014-02-11 21:08   ` Dave Chinner
2014-02-11 21:49   ` Eric Sandeen
2014-02-11 21:49     ` Eric Sandeen
2014-02-12  0:44     ` Dave Jones
2014-02-12  0:44       ` Dave Jones
2014-02-12  1:09       ` Al Viro
2014-02-12  1:09         ` Al Viro
2014-02-12  2:52         ` Linus Torvalds
2014-02-12  2:52           ` Linus Torvalds
2014-02-12  4:03           ` Dave Jones
2014-02-12  4:03             ` Dave Jones
2014-02-12  4:22             ` Al Viro
2014-02-12  4:22               ` Al Viro
2014-02-12  5:40               ` Dave Chinner
2014-02-12  5:40                 ` Dave Chinner
2014-02-12  5:50                 ` Dave Jones
2014-02-12  5:50                   ` Dave Jones
2014-02-12  6:10                   ` Dave Chinner
2014-02-12  6:10                     ` Dave Chinner
2014-02-12  6:31                     ` Dave Chinner
2014-02-12  6:31                       ` Dave Chinner
2014-02-12  6:59                       ` Linus Torvalds
2014-02-12  6:59                         ` Linus Torvalds
2014-02-12  8:13                         ` Tejun Heo
2014-02-12  8:13                           ` Tejun Heo
2014-02-12 12:44                           ` Steven Rostedt
2014-02-12 12:44                             ` Steven Rostedt
2014-02-12  8:35                         ` Dave Chinner
2014-02-12  8:35                           ` Dave Chinner
2014-02-12 12:50                           ` Steven Rostedt
2014-02-12 12:50                             ` Steven Rostedt
2014-02-12 12:40                         ` Steven Rostedt
2014-02-12 12:40                           ` Steven Rostedt
2014-02-12 13:29                           ` Peter Zijlstra
2014-02-12 13:29                             ` Peter Zijlstra
2014-02-12 14:25                     ` Dave Jones
2014-02-12 14:25                       ` Dave Jones
2014-02-12 21:14                       ` Dave Chinner
2014-02-12 21:14                         ` Dave Chinner
2014-02-12 15:57                     ` Eric Sandeen
2014-02-12 15:57                       ` Eric Sandeen
2014-02-12  6:28                 ` Linus Torvalds
2014-02-12  6:28                   ` Linus Torvalds
2014-02-12  7:18                   ` Dave Chinner
2014-02-12  7:18                     ` Dave Chinner
2014-02-14  0:24                     ` Dave Chinner
2014-02-14  0:24                       ` Dave Chinner
2014-02-14 16:01                       ` Dave Jones
2014-02-14 16:01                         ` Dave Jones
2014-02-15 22:23                         ` Dave Chinner
2014-02-15 22:23                           ` Dave Chinner
2014-02-15 22:28                           ` Dave Jones
2014-02-15 22:28                             ` Dave Jones
2014-02-15 22:43                             ` Linus Torvalds
2014-02-15 22:43                               ` Linus Torvalds
2014-02-15 23:50                       ` Linus Torvalds
2014-02-15 23:50                         ` Linus Torvalds
2014-02-15 23:50                         ` Linus Torvalds
2014-02-18  1:27                         ` Dave Chinner
2014-02-18  1:27                           ` Dave Chinner
2014-02-18  1:27                           ` Dave Chinner
2014-02-12 11:39                   ` Al Viro
2014-02-12 11:39                     ` Al Viro
2014-02-12 20:13                     ` Linus Torvalds
2014-02-12 20:13                       ` Linus Torvalds
2014-02-12 21:14                       ` Al Viro
2014-02-12 21:14                         ` Al Viro
2014-02-12 21:32                         ` Linus Torvalds
2014-02-12 21:32                           ` Linus Torvalds
2014-02-12 21:44                           ` Al Viro
2014-02-12 21:44                             ` Al Viro
2014-02-13 20:51                             ` Al Viro
2014-02-13 20:51                               ` Al Viro
2014-02-14  0:09                               ` Al Viro
2014-02-14  0:09                                 ` Al Viro
2014-02-14 13:25                               ` Christoph Hellwig
2014-02-14 13:25                                 ` Christoph Hellwig
2014-02-14 13:29                                 ` Richard Weinberger
2014-02-14 13:29                                   ` Richard Weinberger
2014-02-14 15:20                                   ` Al Viro
2014-02-14 15:20                                     ` Al Viro
2014-02-14 16:08                                     ` Oleg Nesterov
2014-02-14 16:08                                       ` Oleg Nesterov
2014-02-13 17:40                           ` Oleg Nesterov
2014-02-13 17:40                             ` Oleg Nesterov
2014-02-13 17:58                             ` Linus Torvalds
2014-02-13 17:58                               ` Linus Torvalds
2014-02-13 18:10                               ` Oleg Nesterov
2014-02-13 18:10                                 ` Oleg Nesterov
2014-02-13 18:37                                 ` Oleg Nesterov
2014-02-13 18:37                                   ` Oleg Nesterov
2014-02-15  5:25                               ` Al Viro
2014-02-15  5:25                                 ` Al Viro
2014-02-15 14:27                                 ` Oleg Nesterov
2014-02-15 14:27                                   ` Oleg Nesterov
2014-02-15 15:22                                   ` Al Viro
2014-02-15 15:22                                     ` Al Viro
2014-02-15 15:33                                     ` Oleg Nesterov
2014-02-15 15:33                                       ` Oleg Nesterov
2014-02-15 15:36                                     ` Al Viro
2014-02-15 15:36                                       ` Al Viro
2014-02-15 15:58                                       ` Al Viro
2014-02-15 15:58                                         ` Al Viro
2014-02-15 16:59                                         ` Al Viro
2014-02-15 16:59                                           ` Al Viro
2014-02-15 17:43                                         ` Oleg Nesterov
2014-02-15 17:43                                           ` Oleg Nesterov
2014-02-15 18:05                                           ` Al Viro
2014-02-15 18:05                                             ` Al Viro
2014-02-15 18:45                                             ` Oleg Nesterov
2014-02-15 18:45                                               ` Oleg Nesterov
2014-02-17 16:57                                               ` Oleg Nesterov
2014-02-17 16:57                                                 ` Oleg Nesterov
2014-02-17 17:40                                                 ` Al Viro
2014-02-17 17:40                                                   ` Al Viro
2014-02-17 17:46                                                   ` Oleg Nesterov
2014-02-17 17:46                                                     ` Oleg Nesterov
2014-02-17 17:54                                                     ` Al Viro
2014-02-17 17:54                                                       ` Al Viro
2014-02-14 16:13                           ` Christoph Hellwig
2014-02-14 16:13                             ` Christoph Hellwig
2014-02-14 16:16                             ` Al Viro
2014-02-14 16:16                               ` Al Viro
2014-02-14 16:18                               ` Al Viro
2014-02-14 16:18                                 ` Al Viro
2014-02-14 16:19                               ` Christoph Hellwig
2014-02-14 16:19                                 ` Christoph Hellwig
2014-02-15 14:46                                 ` Oleg Nesterov
2014-02-15 14:46                                   ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.