* 2.6.25-git1: Solid hang on HP nx6325 (64-bit) @ 2008-04-19 13:22 Rafael J. Wysocki 2008-04-20 19:04 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Rafael J. Wysocki 0 siblings, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-19 13:22 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, Linus Torvalds, LKML Hi, 2.6.25-git1 hung solid on my HP nx6325 when I was trying to run KMail. That happened after running it for several hours with a couple of suspend and hibernation cycles in between, but I hadn't been doing anything unusual with it. Well, that has never happened since 2.6.25-rc1 (at least) on this box, so it looks worrisome. I guess one of the x86 changes is responsible for it. Thanks, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth ^ permalink raw reply [flat|nested] 183+ messages in thread
* 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-19 13:22 2.6.25-git1: Solid hang on HP nx6325 (64-bit) Rafael J. Wysocki @ 2008-04-20 19:04 ` Rafael J. Wysocki 2008-04-20 19:14 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-20 19:04 UTC (permalink / raw) To: LKML; +Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, linux-ext4 Hi, I've just got the following traces from 2.6.25-git2 on HP nx6325 (64-bit). I think they are related to the hang I described yesterday: [12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff [12844.066765] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.066775] PGD 203067 PUD 204067 PMD 0 [12844.066778] Oops: 0000 [1] SMP DEBUG_PAGEALLOC [12844.066782] CPU 1 [12844.066784] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core [12844.066854] Pid: 13078, comm: kio_file Tainted: G M 2.6.25 #401 [12844.066857] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.066861] RSP: 0018:ffff810064c5dc08 EFLAGS: 00010286 [12844.066863] RAX: ffffffffffffffff RBX: ffff8100f0bd7e10 RCX: 0000000000000012 [12844.066866] RDX: ffffffffffffffff RSI: ffff810064c5dd08 RDI: ffff810053304000 [12844.066868] RBP: ffff810064c5dc58 R08: 0000000000000003 R09: 0000000000000001 [12844.066871] R10: 0000000000000000 R11: 0000000000000246 R12: ffff810053304000 [12844.066873] R13: ffff810064c5dd08 R14: 000000005b3d8b1c R15: 000000000000001a [12844.066876] FS: 00007f08e0719700(0000) GS:ffff81007782d480(0000) knlGS:0000000000000000 [12844.066879] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [12844.066881] CR2: ffffffffffffffff CR3: 000000006a4f2000 CR4: 00000000000006a0 [12844.066884] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12844.066886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [12844.066889] Process kio_file (pid: 13078, threadinfo ffff810064c5c000, task ffff81005a8c8000) [12844.066891] Stack: ffff81000cdde000 000000000000001a ffff8100504a3000 000000000e310f76 [12844.066897] ffffffffffffffff ffff810068c941c0 ffff810064c5de38 ffff8100533050c8 [12844.066901] 0000000000000000 ffff810064c5de38 ffff810064c5dca8 ffffffff8029e236 [12844.066905] Call Trace: [12844.066919] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 [12844.066930] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd [12844.066955] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a [12844.066961] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c [12844.066973] [<ffffffff802a09ec>] path_walk+0x61/0xc3 [12844.066981] [<ffffffff802a0cd2>] do_path_lookup+0x15d/0x1d9 [12844.066991] [<ffffffff802a161a>] __user_walk_fd+0x41/0x5c [12844.067000] [<ffffffff8029a252>] vfs_lstat_fd+0x24/0x5a [12844.067007] [<ffffffff8030b30d>] ? _atomic_dec_and_lock+0x3d/0x5c [12844.067013] [<ffffffff802abe02>] ? mntput_no_expire+0x20/0x8b [12844.067019] [<ffffffff8029dfe8>] ? path_put+0x2c/0x30 [12844.067021] [<ffffffff802b128d>] ? sys_getxattr+0x60/0x75 [12844.067021] [<ffffffff8029a2aa>] sys_newlstat+0x22/0x3c [12844.067021] [<ffffffff8020bf1b>] system_call_after_swapgs+0x7b/0x80 [12844.067021] [12844.067021] [12844.067021] Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff [12844.067021] RIP [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.067021] RSP <ffff810064c5dc08> [12844.067021] CR2: ffffffffffffffff [12844.067021] ---[ end trace 02645136ff144df9 ]--- [12844.112513] BUG: unable to handle kernel paging request at ffffffffffffffff [12844.112521] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.112530] PGD 203067 PUD 204067 PMD 0 [12844.112533] Oops: 0000 [2] SMP DEBUG_PAGEALLOC [12844.112537] CPU 1 [12844.112539] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core [12844.112608] Pid: 13080, comm: kio_file Tainted: G M D 2.6.25 #401 [12844.112610] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.112614] RSP: 0018:ffff81006409dc08 EFLAGS: 00010286 [12844.112617] RAX: ffffffffffffffff RBX: ffff8100f0bd7e10 RCX: 0000000000000012 [12844.112620] RDX: ffffffffffffffff RSI: ffff81006409dd08 RDI: ffff810053304320 [12844.112622] RBP: ffff81006409dc58 R08: 0000000000000003 R09: 0000000000000001 [12844.112625] R10: 0000000000000000 R11: 0000000000000246 R12: ffff810053304320 [12844.112627] R13: ffff81006409dd08 R14: 00000000c93d6a90 R15: 0000000000000019 [12844.112630] FS: 00007f08e0719700(0000) GS:ffff81007782d480(0000) knlGS:0000000000000000 [12844.112633] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [12844.112635] CR2: ffffffffffffffff CR3: 0000000064052000 CR4: 00000000000006a0 [12844.112638] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12844.112640] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [12844.112643] Process kio_file (pid: 13080, threadinfo ffff81006409c000, task ffff81005d0788e0) [12844.112645] Stack: ffff810035bfd000 0000000000000019 ffff8100504a2000 0000000011c4a621 [12844.112651] ffffffffffffffff ffff8100769c41c0 ffff81006409de38 ffff810053305cc8 [12844.112655] 0000000000000000 ffff81006409de38 ffff81006409dca8 ffffffff8029e236 [12844.112659] Call Trace: [12844.112673] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 [12844.112683] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd [12844.112707] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a [12844.112714] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c [12844.112726] [<ffffffff802a09ec>] path_walk+0x61/0xc3 [12844.112734] [<ffffffff802a0cd2>] do_path_lookup+0x15d/0x1d9 [12844.112744] [<ffffffff802a161a>] __user_walk_fd+0x41/0x5c [12844.112752] [<ffffffff8029a252>] vfs_lstat_fd+0x24/0x5a [12844.112759] [<ffffffff8030b30d>] ? _atomic_dec_and_lock+0x3d/0x5c [12844.112765] [<ffffffff802abe02>] ? mntput_no_expire+0x20/0x8b [12844.112771] [<ffffffff8029dfe8>] ? path_put+0x2c/0x30 [12844.112777] [<ffffffff802b128d>] ? sys_getxattr+0x60/0x75 [12844.112785] [<ffffffff8029a2aa>] sys_newlstat+0x22/0x3c [12844.112802] [<ffffffff8020bf1b>] system_call_after_swapgs+0x7b/0x80 [12844.112814] [12844.112815] [12844.112816] Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff [12844.112841] RIP [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.112844] RSP <ffff81006409dc08> [12844.112846] CR2: ffffffffffffffff [12844.112849] ---[ end trace 02645136ff144df9 ]--- [12877.045189] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12877.882177] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:01:00:17:9a:f3:b5:75:08:00 SRC=62.121.83.254 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0xC0 TTL=1 ID=43194 PROTO=2 [12887.026920] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 [12901.622330] Machine check events logged [12938.158203] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:01:00:17:9a:f3:b5:75:08:00 SRC=62.121.83.254 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0xC0 TTL=1 ID=45263 PROTO=2 [12939.885172] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 [12959.627487] ACPI: Transitioning device [C352] to D0 [12959.627497] ACPI: Unable to turn cooling device [ffff810077859c80] 'on' [12998.650198] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:01:00:17:9a:f3:b5:75:08:00 SRC=62.121.83.254 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0xC0 TTL=1 ID=47341 PROTO=2 [13001.353794] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 [13005.057167] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [13017.810755] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [13025.971847] BUG: unable to handle kernel paging request at ffff81f0210de4c8 [13025.971855] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [13025.971865] PGD 0 [13025.971867] Oops: 0000 [3] SMP DEBUG_PAGEALLOC [13025.971871] CPU 1 [13025.971873] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core [13025.971941] Pid: 13061, comm: kmail Tainted: G M D 2.6.25 #401 [13025.971944] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [13025.971949] RSP: 0018:ffff8100501efd28 EFLAGS: 00010282 [13025.971951] RAX: ffff81f0210de4c8 RBX: ffff81002032a898 RCX: 0000000000000012 [13025.971954] RDX: ffff81f0210de4c8 RSI: ffff8100501efe98 RDI: ffff810063823ed8 [13025.971956] RBP: ffff8100501efd78 R08: ffff8100501efe88 R09: 0000000000000000 [13025.971959] R10: ffff8100395fce50 R11: 0000000000000206 R12: ffff810063823ed8 [13025.971961] R13: ffff8100501efe98 R14: 0000000048eb1dd0 R15: 0000000000000011 [13025.971964] FS: 00007f59b8083700(0000) GS:ffff81007782d480(0000) knlGS:0000000000000000 [13025.971967] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13025.971969] CR2: ffff81f0210de4c8 CR3: 0000000077066000 CR4: 00000000000006a0 [13025.971972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [13025.971974] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [13025.971977] Process kmail (pid: 13061, threadinfo ffff8100501ee000, task ffff810055896000) [13025.971979] Stack: ffff8100501efe88 0000000000000011 ffff810064c5c01f ffffffff802a0cd2 [13025.971985] ffff81f0210de4c8 ffff8100501efe88 ffff8100501efe88 ffff810063823ed8 [13025.971989] ffff8100501efe98 ffff8100501efe88 ffff8100501efdb8 ffffffff8029e411 [13025.971993] Call Trace: [13025.972002] [<ffffffff802a0cd2>] ? do_path_lookup+0x15d/0x1d9 [13025.972011] [<ffffffff8029e411>] __lookup_hash+0x55/0x117 [13025.972019] [<ffffffff8029e50b>] lookup_hash+0x38/0x43 [13025.972025] [<ffffffff802a1bc5>] open_namei+0xf1/0x694 [13025.972030] [<ffffffff802a0cd2>] ? do_path_lookup+0x15d/0x1d9 [13025.972038] [<ffffffff8030b30d>] ? _atomic_dec_and_lock+0x3d/0x5c [13025.972049] [<ffffffff8029575d>] do_filp_open+0x28/0x4b [13025.972061] [<ffffffff80295488>] ? get_unused_fd_flags+0x80/0x114 [13025.972069] [<ffffffff802957d1>] do_sys_open+0x51/0xd2 [13025.972077] [<ffffffff8029587b>] sys_open+0x1b/0x1d [13025.972082] [<ffffffff8020bf1b>] system_call_after_swapgs+0x7b/0x80 [13025.972094] [13025.972095] [13025.972096] Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff [13025.972120] RIP [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [13025.972124] RSP <ffff8100501efd28> [13025.972126] CR2: ffff81f0210de4c8 [13025.972134] ---[ end trace 02645136ff144df9 ]--- [13064.772991] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 [13101.953964] general protection fault: 0000 [4] SMP DEBUG_PAGEALLOC [13101.953971] CPU 1 [13101.953973] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core [13101.954037] Pid: 13254, comm: preload Tainted: G M D 2.6.25 #401 [13101.954040] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [13101.954049] RSP: 0018:ffff810034a85c08 EFLAGS: 00010282 [13101.954051] RAX: fff0810023444c98 RBX: ffff81002010bed8 RCX: 0000000000000012 [13101.954054] RDX: fff0810023444c98 RSI: ffff810034a85d08 RDI: ffff810071be34b0 [13101.954057] RBP: ffff810034a85c58 R08: ffff810034a85e38 R09: 3239312f726f6c6f [13101.954059] R10: 746e692f32393178 R11: 0000000000000246 R12: ffff810071be34b0 [13101.954062] R13: ffff810034a85d08 R14: 0000000045515a03 R15: 000000000000000e [13101.954065] FS: 00007f235d3b96f0(0000) GS:ffff81007782d480(0000) knlGS:0000000000000000 [13101.954067] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [13101.954070] CR2: 000000000084b000 CR3: 000000005d068000 CR4: 00000000000006a0 [13101.954072] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [13101.954075] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [13101.954078] Process preload (pid: 13254, threadinfo ffff810034a84000, task ffff8100597a31c0) [13101.954080] Stack: ffff810071be34b0 000000000000000e ffff810075dbf026 ffffffffa0045289 [13101.954085] fff0810023444c98 ffff8100769c41ed ffff810034a85e38 ffff810071be4cc8 [13101.954090] ffff8100769ce804 ffff810034a85e38 ffff810034a85ca8 ffffffff8029e236 [13101.954094] Call Trace: [13101.954116] [<ffffffffa0045289>] ? :ext3:ext3_lookup+0xa3/0xd0 [13101.954127] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 [13101.954137] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd [13101.954148] [<ffffffff80273df4>] ? generic_file_aio_read+0x4eb/0x55c [13101.954161] [<ffffffff802a09ec>] path_walk+0x61/0xc3 [13101.954170] [<ffffffff802a0cd2>] do_path_lookup+0x15d/0x1d9 [13101.954180] [<ffffffff802a161a>] __user_walk_fd+0x41/0x5c [13101.954189] [<ffffffff8029a33d>] vfs_stat_fd+0x27/0x5d [13101.954199] [<ffffffff8022e6ba>] ? hrtick_set+0xdf/0xe8 [13101.954208] [<ffffffff80442a93>] ? thread_return+0x69/0xad [13101.954219] [<ffffffff8029a3dc>] sys_newstat+0x22/0x3c [13101.954225] [<ffffffff802976e3>] ? vfs_read+0x11f/0x134 [13101.954233] [<ffffffff80297a33>] ? sys_read+0x47/0x6f [13101.954242] [<ffffffff8020bf1b>] system_call_after_swapgs+0x7b/0x80 [13101.954254] [13101.954255] [13101.954257] Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff [13101.954282] RIP [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [13101.954286] RSP <ffff810034a85c08> [13101.954295] ---[ end trace 02645136ff144df9 ]--- [13119.333166] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:01:00:17:9a:f3:b5:75:08:00 SRC=62.121.83.254 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0xC0 TTL=1 ID=51666 PROTO=2 [13120.837389] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 [13179.516270] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:01:00:17:9a:f3:b5:75:08:00 SRC=62.121.83.254 DST=224.0.0.1 LEN=28 TOS=0x00 PREC=0xC0 TTL=1 ID=59251 PROTO=2 [13187.109038] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF OPT (94040000) PROTO=2 Moreover, I got a general protection fault in shrink_dcache_sb(), but I hadn't been able to write down the exact address before it was wiped away from the screed. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-20 19:04 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Rafael J. Wysocki @ 2008-04-20 19:14 ` Rafael J. Wysocki 2008-04-20 21:31 ` Linus Torvalds 2008-04-21 13:17 ` Ingo Molnar 2 siblings, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-20 19:14 UTC (permalink / raw) To: LKML; +Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, linux-ext4 On Sunday, 20 of April 2008, Rafael J. Wysocki wrote: > Hi, > > I've just got the following traces from 2.6.25-git2 on HP nx6325 (64-bit). > I think they are related to the hang I described yesterday: > > [12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff > [12844.066765] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 > [12844.066775] PGD 203067 PUD 204067 PMD 0 > [12844.066778] Oops: 0000 [1] SMP DEBUG_PAGEALLOC > [12844.066782] CPU 1 > [12844.066784] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbc > ore edd ext3 jbd atiixp ide_core > [12844.066854] Pid: 13078, comm: kio_file Tainted: G M 2.6.25 #401 > [12844.066857] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 That is line 1250 in fs/dcache.c, btw: (gdb) l *__d_lookup+0xf1 0xffffffff802a7b3c is in __d_lookup (/home/rafael/src/linux-2.6/fs/dcache.c:1250). 1245 struct hlist_node *node; 1246 struct dentry *dentry; 1247 1248 rcu_read_lock(); 1249 1250 hlist_for_each_entry_rcu(dentry, node, head, d_hash) { 1251 struct qstr *qstr; 1252 1253 if (dentry->d_name.hash != hash) 1254 continue; ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-20 19:04 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Rafael J. Wysocki 2008-04-20 19:14 ` Rafael J. Wysocki @ 2008-04-20 21:31 ` Linus Torvalds 2008-04-21 1:18 ` Herbert Xu 2008-04-21 16:12 ` Rafael J. Wysocki 2008-04-21 13:17 ` Ingo Molnar 2 siblings, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-20 21:31 UTC (permalink / raw) To: Rafael J. Wysocki Cc: LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney On Sun, 20 Apr 2008, Rafael J. Wysocki wrote: > > I've just got the following traces from 2.6.25-git2 on HP nx6325 (64-bit). > I think they are related to the hang I described yesterday: > > [12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff Something has added a dentry pointer that has the value -1 to the dentry hash list. The access that oopses seems to be the prefetch(pos->next) which is part of hlist_for_each_entry_rcu(), where "pos" is -1. I suspect it's an RCU error, ie somebody has released a dentry entry, and free'd it without waiting for the RCU grace period. Talking about RCU I also think that whoever did those "rcu_dereference()" macros in <linux/list.h> was insane. It's totally pointless to do "rcu_dereference()" on a local variable. It simply *cannot* make sense. Herbert, Paul, you guys should look at it. As far as I can tell, rcu_dereference() should _always_ be done when we access the "next" pointer (except for when prefetching, where we simply don't care). Paul? Herbert? Totally untested patch appended. NOTE! I do not expect this patch to matter for this oops. There's something else going on there. Linus --- include/linux/list.h | 34 +++++++++++++++++----------------- 1 files changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/list.h b/include/linux/list.h index 75ce2cb..4a851ba 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -631,14 +631,14 @@ static inline void list_splice_init_rcu(struct list_head *list, * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - prefetch(rcu_dereference(pos)->next), pos != (head); \ - pos = pos->next) + for (pos = rcu_dereference((head)->next); \ + prefetch(pos->next), pos != (head); \ + pos = rcu_dereference(pos->next)) #define __list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - rcu_dereference(pos) != (head); \ - pos = pos->next) + for (pos = rcu_dereference((head)->next); \ + pos != (head); \ + pos = rcu_dereference(pos->next)) /** * list_for_each_safe_rcu @@ -653,8 +653,8 @@ static inline void list_splice_init_rcu(struct list_head *list, * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_safe_rcu(pos, n, head) \ - for (pos = (head)->next; \ - n = rcu_dereference(pos)->next, pos != (head); \ + for (pos = rcu_dereference((head)->next); \ + n = rcu_dereference((pos)->next), pos != (head); \ pos = n) /** @@ -668,10 +668,10 @@ static inline void list_splice_init_rcu(struct list_head *list, * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry((head)->next, typeof(*pos), member); \ - prefetch(rcu_dereference(pos)->member.next), \ + for (pos = list_entry(rcu_dereference((head)->next), typeof(*pos), member); \ + prefetch(pos->member.next), \ &pos->member != (head); \ - pos = list_entry(pos->member.next, typeof(*pos), member)) + pos = list_entry(rcu_dereference(pos->member.next), typeof(*pos), member)) /** @@ -686,9 +686,9 @@ static inline void list_splice_init_rcu(struct list_head *list, * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_continue_rcu(pos, head) \ - for ((pos) = (pos)->next; \ - prefetch(rcu_dereference((pos))->next), (pos) != (head); \ - (pos) = (pos)->next) + for ((pos) = rcu_dereference((pos)->next); \ + prefetch((pos)->next), (pos) != (head); \ + (pos) = rcu_dereference((pos)->next)) /* * Double linked lists with a single pointer list head. @@ -986,10 +986,10 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev, * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ - for (pos = (head)->first; \ - rcu_dereference(pos) && ({ prefetch(pos->next); 1;}) && \ + for (pos = rcu_dereference((head)->first); \ + ({ prefetch(pos->next); 1;}) && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ - pos = pos->next) + pos = rcu_dereference(pos->next)) #else #warning "don't include kernel headers in userspace" ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-20 21:31 ` Linus Torvalds @ 2008-04-21 1:18 ` Herbert Xu 2008-04-21 2:08 ` Paul E. McKenney 2008-04-21 16:12 ` Rafael J. Wysocki 1 sibling, 1 reply; 183+ messages in thread From: Herbert Xu @ 2008-04-21 1:18 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Paul E. McKenney Hi Linus: On Sun, Apr 20, 2008 at 02:31:48PM -0700, Linus Torvalds wrote: > > Talking about RCU I also think that whoever did those "rcu_dereference()" > macros in <linux/list.h> was insane. It's totally pointless to do > "rcu_dereference()" on a local variable. It simply *cannot* make sense. > Herbert, Paul, you guys should look at it. Since I made the macros look this way I'm obliged to defend it :) > #define list_for_each_rcu(pos, head) \ > - for (pos = (head)->next; \ > - prefetch(rcu_dereference(pos)->next), pos != (head); \ > - pos = pos->next) > + for (pos = rcu_dereference((head)->next); \ > + prefetch(pos->next), pos != (head); \ > + pos = rcu_dereference(pos->next)) Semantically there should be no difference between the two versions. The purpose of rcu_dereference is really similar to smp_rmb, i.e., it adds a (conditional) read barrier between what has been read so far (including its argument), and what will be read subsequently. So if we expand out the current code it would look like fetch (head)->next store into pos again: smp_read_barrier_depends() prefetch(pos->next) pos != (head) ...loop body... fetch pos->next store into pos goto again Yours looks like fetch (head)->next smp_read_barrier_depends() store into pos again: prefetch(pos->next) pos != (head) ...loop body... fetch pos->next smp_read_barrier_depends() store into pos goto again As the objective here is to insert a barrier before dereferencing pos (e.g., reading pos->next or using it in the loop body), these two should be identical. But I do concede that your version looks clearer, and has the benefit that should prefetch ever be optimised out with no side- effects, yours would still be correct while the current one will lose the barrier completely. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 1:18 ` Herbert Xu @ 2008-04-21 2:08 ` Paul E. McKenney 2008-04-21 4:59 ` Paul E. McKenney 2008-04-21 15:49 ` Linus Torvalds 0 siblings, 2 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 2:08 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Mon, Apr 21, 2008 at 09:18:55AM +0800, Herbert Xu wrote: > Hi Linus: > > On Sun, Apr 20, 2008 at 02:31:48PM -0700, Linus Torvalds wrote: > > > > Talking about RCU I also think that whoever did those "rcu_dereference()" > > macros in <linux/list.h> was insane. It's totally pointless to do > > "rcu_dereference()" on a local variable. It simply *cannot* make sense. > > Herbert, Paul, you guys should look at it. > > Since I made the macros look this way I'm obliged to defend it :) > > > #define list_for_each_rcu(pos, head) \ > > - for (pos = (head)->next; \ > > - prefetch(rcu_dereference(pos)->next), pos != (head); \ > > - pos = pos->next) > > + for (pos = rcu_dereference((head)->next); \ > > + prefetch(pos->next), pos != (head); \ > > + pos = rcu_dereference(pos->next)) > > Semantically there should be no difference between the two versions. > The purpose of rcu_dereference is really similar to smp_rmb, i.e., > it adds a (conditional) read barrier between what has been read so > far (including its argument), and what will be read subsequently. > > So if we expand out the current code it would look like > > fetch (head)->next > store into pos > again: > smp_read_barrier_depends() > prefetch(pos->next) > pos != (head) > > ...loop body... > > fetch pos->next > store into pos > goto again > > Yours looks like > > fetch (head)->next > smp_read_barrier_depends() > store into pos > again: > prefetch(pos->next) > pos != (head) > > ...loop body... > > fetch pos->next > smp_read_barrier_depends() > store into pos > goto again > > As the objective here is to insert a barrier before dereferencing > pos (e.g., reading pos->next or using it in the loop body), these > two should be identical. > > But I do concede that your version looks clearer, and has the > benefit that should prefetch ever be optimised out with no side- > effects, yours would still be correct while the current one will > lose the barrier completely. Agreed as well -- compilers would also be within their right to bypass the rcu_dereference() around the test/prefetch, which would allow them to refetch. For example, with __list_for_each_rcu(), the original implementation allows the compiler to treat a use of "pos" within the body of the loop as if it was a use of (head)->next, refetching if convenient. Not so good. So good catch, Linus!!! Could we also eliminate the (both unused in 2.6.25 and useless as well) list_for_each_safe_rcu()? After all, if you use list_del_rcu() and call_rcu(), all the RCU list-traversal primitives are "safe" in this sense. Patch attached (testing in progress), based on Linus's earlier patch. Signed_off_by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- list.h | 47 +++++++++++++++-------------------------------- 1 file changed, 15 insertions(+), 32 deletions(-) diff -urpNa linux-2.6.25/include/linux/list.h linux-2.6.25-rcu-list/include/linux/list.h --- linux-2.6.25/include/linux/list.h 2008-04-16 19:49:44.000000000 -0700 +++ linux-2.6.25-rcu-list/include/linux/list.h 2008-04-20 18:44:55.000000000 -0700 @@ -631,31 +631,14 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - prefetch(rcu_dereference(pos)->next), pos != (head); \ - pos = pos->next) + for (pos = rcu_dereference((head)->next); \ + prefetch(pos->next), pos != (head); \ + pos = rcu_dereference(pos->next)) #define __list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - rcu_dereference(pos) != (head); \ - pos = pos->next) - -/** - * list_for_each_safe_rcu - * @pos: the &struct list_head to use as a loop cursor. - * @n: another &struct list_head to use as temporary storage - * @head: the head for your list. - * - * Iterate over an rcu-protected list, safe against removal of list entry. - * - * This list-traversal primitive may safely run concurrently with - * the _rcu list-mutation primitives such as list_add_rcu() - * as long as the traversal is guarded by rcu_read_lock(). - */ -#define list_for_each_safe_rcu(pos, n, head) \ - for (pos = (head)->next; \ - n = rcu_dereference(pos)->next, pos != (head); \ - pos = n) + for (pos = rcu_dereference((head)->next); \ + pos != (head); \ + pos = rcu_dereference(pos->next)) /** * list_for_each_entry_rcu - iterate over rcu list of given type @@ -668,10 +651,10 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry((head)->next, typeof(*pos), member); \ - prefetch(rcu_dereference(pos)->member.next), \ + for (pos = list_entry(rcu_dereference((head)->next), typeof(*pos), member); \ + prefetch(pos->member.next), \ &pos->member != (head); \ - pos = list_entry(pos->member.next, typeof(*pos), member)) + pos = list_entry(rcu_dereference(pos->member.next), typeof(*pos), member)) /** @@ -686,9 +669,9 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_continue_rcu(pos, head) \ - for ((pos) = (pos)->next; \ - prefetch(rcu_dereference((pos))->next), (pos) != (head); \ - (pos) = (pos)->next) + for ((pos) = rcu_dereference((pos)->next); \ + prefetch((pos)->next), (pos) != (head); \ + (pos) = rcu_dereference((pos)->next)) /* * Double linked lists with a single pointer list head. @@ -986,10 +969,10 @@ static inline void hlist_add_after_rcu(s * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ - for (pos = (head)->first; \ - rcu_dereference(pos) && ({ prefetch(pos->next); 1;}) && \ + for (pos = rcu_dereference((head)->first); \ + ({ prefetch(pos->next); 1;}) && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ - pos = pos->next) + pos = rcu_dereference(pos->next)) #else #warning "don't include kernel headers in userspace" ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 2:08 ` Paul E. McKenney @ 2008-04-21 4:59 ` Paul E. McKenney 2008-04-21 5:47 ` Paul E. McKenney 2008-04-21 15:49 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 4:59 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 And here is an update with one bug fixed -- testing continues. This is an update of http://lkml.org/lkml/2008/4/20/217, deleting list_for_each_safe_rcu() and fixing hlist_for_each_entry_rcu(). Testing continues... Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- list.h | 48 +++++++++++++++--------------------------------- 1 file changed, 15 insertions(+), 33 deletions(-) diff -urpNa -X dontdiff linux-2.6.25/include/linux/list.h linux-2.6.25-rcu-list/include/linux/list.h --- linux-2.6.25/include/linux/list.h 2008-04-16 19:49:44.000000000 -0700 +++ linux-2.6.25-rcu-list/include/linux/list.h 2008-04-20 21:48:29.000000000 -0700 @@ -631,31 +631,14 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - prefetch(rcu_dereference(pos)->next), pos != (head); \ - pos = pos->next) + for (pos = rcu_dereference((head)->next); \ + prefetch(pos->next), pos != (head); \ + pos = rcu_dereference(pos->next)) #define __list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - rcu_dereference(pos) != (head); \ - pos = pos->next) - -/** - * list_for_each_safe_rcu - * @pos: the &struct list_head to use as a loop cursor. - * @n: another &struct list_head to use as temporary storage - * @head: the head for your list. - * - * Iterate over an rcu-protected list, safe against removal of list entry. - * - * This list-traversal primitive may safely run concurrently with - * the _rcu list-mutation primitives such as list_add_rcu() - * as long as the traversal is guarded by rcu_read_lock(). - */ -#define list_for_each_safe_rcu(pos, n, head) \ - for (pos = (head)->next; \ - n = rcu_dereference(pos)->next, pos != (head); \ - pos = n) + for (pos = rcu_dereference((head)->next); \ + pos != (head); \ + pos = rcu_dereference(pos->next)) /** * list_for_each_entry_rcu - iterate over rcu list of given type @@ -668,10 +651,9 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry((head)->next, typeof(*pos), member); \ - prefetch(rcu_dereference(pos)->member.next), \ - &pos->member != (head); \ - pos = list_entry(pos->member.next, typeof(*pos), member)) + for (pos = list_entry(rcu_dereference((head)->next), typeof(*pos), member); \ + prefetch(pos->member.next), &pos->member != (head); \ + pos = list_entry(rcu_dereference(pos->member.next), typeof(*pos), member)) /** @@ -686,9 +668,9 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_continue_rcu(pos, head) \ - for ((pos) = (pos)->next; \ - prefetch(rcu_dereference((pos))->next), (pos) != (head); \ - (pos) = (pos)->next) + for ((pos) = rcu_dereference((pos)->next); \ + prefetch((pos)->next), (pos) != (head); \ + (pos) = rcu_dereference((pos)->next)) /* * Double linked lists with a single pointer list head. @@ -986,10 +968,10 @@ static inline void hlist_add_after_rcu(s * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ - for (pos = (head)->first; \ - rcu_dereference(pos) && ({ prefetch(pos->next); 1;}) && \ + for (pos = rcu_dereference((head)->first); \ + pos && ({ prefetch(pos->next); 1;}) && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ - pos = pos->next) + pos = rcu_dereference(pos->next)) #else #warning "don't include kernel headers in userspace" ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 4:59 ` Paul E. McKenney @ 2008-04-21 5:47 ` Paul E. McKenney 2008-04-21 13:00 ` Ingo Molnar 2008-04-21 16:06 ` Linus Torvalds 0 siblings, 2 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 5:47 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Sun, Apr 20, 2008 at 09:59:11PM -0700, Paul E. McKenney wrote: > And here is an update with one bug fixed -- testing continues. > This is an update of http://lkml.org/lkml/2008/4/20/217, deleting > list_for_each_safe_rcu() and fixing hlist_for_each_entry_rcu(). > Testing continues... And it passes. Thanx, Paul > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > --- > > list.h | 48 +++++++++++++++--------------------------------- > 1 file changed, 15 insertions(+), 33 deletions(-) > > diff -urpNa -X dontdiff linux-2.6.25/include/linux/list.h linux-2.6.25-rcu-list/include/linux/list.h > --- linux-2.6.25/include/linux/list.h 2008-04-16 19:49:44.000000000 -0700 > +++ linux-2.6.25-rcu-list/include/linux/list.h 2008-04-20 21:48:29.000000000 -0700 > @@ -631,31 +631,14 @@ static inline void list_splice_init_rcu( > * as long as the traversal is guarded by rcu_read_lock(). > */ > #define list_for_each_rcu(pos, head) \ > - for (pos = (head)->next; \ > - prefetch(rcu_dereference(pos)->next), pos != (head); \ > - pos = pos->next) > + for (pos = rcu_dereference((head)->next); \ > + prefetch(pos->next), pos != (head); \ > + pos = rcu_dereference(pos->next)) > > #define __list_for_each_rcu(pos, head) \ > - for (pos = (head)->next; \ > - rcu_dereference(pos) != (head); \ > - pos = pos->next) > - > -/** > - * list_for_each_safe_rcu > - * @pos: the &struct list_head to use as a loop cursor. > - * @n: another &struct list_head to use as temporary storage > - * @head: the head for your list. > - * > - * Iterate over an rcu-protected list, safe against removal of list entry. > - * > - * This list-traversal primitive may safely run concurrently with > - * the _rcu list-mutation primitives such as list_add_rcu() > - * as long as the traversal is guarded by rcu_read_lock(). > - */ > -#define list_for_each_safe_rcu(pos, n, head) \ > - for (pos = (head)->next; \ > - n = rcu_dereference(pos)->next, pos != (head); \ > - pos = n) > + for (pos = rcu_dereference((head)->next); \ > + pos != (head); \ > + pos = rcu_dereference(pos->next)) > > /** > * list_for_each_entry_rcu - iterate over rcu list of given type > @@ -668,10 +651,9 @@ static inline void list_splice_init_rcu( > * as long as the traversal is guarded by rcu_read_lock(). > */ > #define list_for_each_entry_rcu(pos, head, member) \ > - for (pos = list_entry((head)->next, typeof(*pos), member); \ > - prefetch(rcu_dereference(pos)->member.next), \ > - &pos->member != (head); \ > - pos = list_entry(pos->member.next, typeof(*pos), member)) > + for (pos = list_entry(rcu_dereference((head)->next), typeof(*pos), member); \ > + prefetch(pos->member.next), &pos->member != (head); \ > + pos = list_entry(rcu_dereference(pos->member.next), typeof(*pos), member)) > > > /** > @@ -686,9 +668,9 @@ static inline void list_splice_init_rcu( > * as long as the traversal is guarded by rcu_read_lock(). > */ > #define list_for_each_continue_rcu(pos, head) \ > - for ((pos) = (pos)->next; \ > - prefetch(rcu_dereference((pos))->next), (pos) != (head); \ > - (pos) = (pos)->next) > + for ((pos) = rcu_dereference((pos)->next); \ > + prefetch((pos)->next), (pos) != (head); \ > + (pos) = rcu_dereference((pos)->next)) > > /* > * Double linked lists with a single pointer list head. > @@ -986,10 +968,10 @@ static inline void hlist_add_after_rcu(s > * as long as the traversal is guarded by rcu_read_lock(). > */ > #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ > - for (pos = (head)->first; \ > - rcu_dereference(pos) && ({ prefetch(pos->next); 1;}) && \ > + for (pos = rcu_dereference((head)->first); \ > + pos && ({ prefetch(pos->next); 1;}) && \ > ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ > - pos = pos->next) > + pos = rcu_dereference(pos->next)) > > #else > #warning "don't include kernel headers in userspace" ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 5:47 ` Paul E. McKenney @ 2008-04-21 13:00 ` Ingo Molnar 2008-04-21 16:06 ` Linus Torvalds 1 sibling, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-21 13:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Herbert Xu, Linus Torvalds, Rafael J. Wysocki, LKML, Andrew Morton, linux-ext4 * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Sun, Apr 20, 2008 at 09:59:11PM -0700, Paul E. McKenney wrote: > > And here is an update with one bug fixed -- testing continues. > > This is an update of http://lkml.org/lkml/2008/4/20/217, deleting > > list_for_each_safe_rcu() and fixing hlist_for_each_entry_rcu(). > > Testing continues... > > And it passes. i have queued up your patch in its form below. (but Linus might beat me at applying it) Ingo --------------------> Subject: RCU, list.h: fix list iterators From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Date: Sun, 20 Apr 2008 21:59:13 -0700 RCU list iterators: should prefetch ever be optimised out with no side-effects, the current version will lose the barrier completely. Pointed-out-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/list.h | 48 +++++++++++++++--------------------------------- 1 file changed, 15 insertions(+), 33 deletions(-) Index: linux/include/linux/list.h =================================================================== --- linux.orig/include/linux/list.h +++ linux/include/linux/list.h @@ -631,31 +631,14 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - prefetch(rcu_dereference(pos)->next), pos != (head); \ - pos = pos->next) + for (pos = rcu_dereference((head)->next); \ + prefetch(pos->next), pos != (head); \ + pos = rcu_dereference(pos->next)) #define __list_for_each_rcu(pos, head) \ - for (pos = (head)->next; \ - rcu_dereference(pos) != (head); \ - pos = pos->next) - -/** - * list_for_each_safe_rcu - * @pos: the &struct list_head to use as a loop cursor. - * @n: another &struct list_head to use as temporary storage - * @head: the head for your list. - * - * Iterate over an rcu-protected list, safe against removal of list entry. - * - * This list-traversal primitive may safely run concurrently with - * the _rcu list-mutation primitives such as list_add_rcu() - * as long as the traversal is guarded by rcu_read_lock(). - */ -#define list_for_each_safe_rcu(pos, n, head) \ - for (pos = (head)->next; \ - n = rcu_dereference(pos)->next, pos != (head); \ - pos = n) + for (pos = rcu_dereference((head)->next); \ + pos != (head); \ + pos = rcu_dereference(pos->next)) /** * list_for_each_entry_rcu - iterate over rcu list of given type @@ -668,10 +651,9 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry((head)->next, typeof(*pos), member); \ - prefetch(rcu_dereference(pos)->member.next), \ - &pos->member != (head); \ - pos = list_entry(pos->member.next, typeof(*pos), member)) + for (pos = list_entry(rcu_dereference((head)->next), typeof(*pos), member); \ + prefetch(pos->member.next), &pos->member != (head); \ + pos = list_entry(rcu_dereference(pos->member.next), typeof(*pos), member)) /** @@ -686,9 +668,9 @@ static inline void list_splice_init_rcu( * as long as the traversal is guarded by rcu_read_lock(). */ #define list_for_each_continue_rcu(pos, head) \ - for ((pos) = (pos)->next; \ - prefetch(rcu_dereference((pos))->next), (pos) != (head); \ - (pos) = (pos)->next) + for ((pos) = rcu_dereference((pos)->next); \ + prefetch((pos)->next), (pos) != (head); \ + (pos) = rcu_dereference((pos)->next)) /* * Double linked lists with a single pointer list head. @@ -986,10 +968,10 @@ static inline void hlist_add_after_rcu(s * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ - for (pos = (head)->first; \ - rcu_dereference(pos) && ({ prefetch(pos->next); 1;}) && \ + for (pos = rcu_dereference((head)->first); \ + pos && ({ prefetch(pos->next); 1;}) && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ - pos = pos->next) + pos = rcu_dereference(pos->next)) #else #warning "don't include kernel headers in userspace" ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 5:47 ` Paul E. McKenney 2008-04-21 13:00 ` Ingo Molnar @ 2008-04-21 16:06 ` Linus Torvalds 2008-04-21 16:24 ` Rafael J. Wysocki 1 sibling, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 16:06 UTC (permalink / raw) To: Paul E. McKenney Cc: Herbert Xu, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Sun, 20 Apr 2008, Paul E. McKenney wrote: > > And it passes. Ok, I applied it, with hopefully an understandable commit message. That said, now we just need to figure out what actually caused the bug in question. Rafael: if it's a too-early free of the dentry (which could be because somebody didn't do a proper rcu read-lock, or maybe the rcu grace period logic itself got broken?), then enabling SLUB/SLAB debugging should catch it much more quickly (and hopefully we'd see the signature of a use-after-free - the poisoning byte pattern rather than the -1). The other alternative is simply memory corruption. Ie the -1 may well be somebody *else* overwritin the ->next pointer because they did a use-after-free and maybe the dentry_cache is shared with some other allocation of the same size (SLUB does that, no?) Rafael: your last oops does seem to imply that there is some strange memory corruption going on, because in that case the invalid pointer is different: instead of being all-ones, it is "fff0810023444c98", which is not a possible pointer. It very much looks like a single nybble got cleared (because ffff810023444c98 _would_ be a valid pointer, notice the "fff0" vs "ffff" prefix). So I do suspect it's *some* kind of use-after-free thing. But nothing in fs/ has changed, so it's not a dentry bug, I think. Which is why my "preferred" suspect is that "somebody else also does allocations of the same size as the dentry code, and shares the same SLUB alloc space, and does something bad". So Rafael - are you using SLUB, and if you are, can you enable SLUB_DEBUG, and then use the "slub_debug" kernel command line to enable it? Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 16:06 ` Linus Torvalds @ 2008-04-21 16:24 ` Rafael J. Wysocki 0 siblings, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-21 16:24 UTC (permalink / raw) To: Linus Torvalds Cc: Paul E. McKenney, Herbert Xu, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Monday, 21 of April 2008, Linus Torvalds wrote: > > On Sun, 20 Apr 2008, Paul E. McKenney wrote: > > > > And it passes. > > Ok, I applied it, with hopefully an understandable commit message. > > That said, now we just need to figure out what actually caused the bug in > question. > > Rafael: if it's a too-early free of the dentry (which could be because > somebody didn't do a proper rcu read-lock, or maybe the rcu grace period > logic itself got broken?), then enabling SLUB/SLAB debugging should catch > it much more quickly (and hopefully we'd see the signature of a > use-after-free - the poisoning byte pattern rather than the -1). > > The other alternative is simply memory corruption. Ie the -1 may well be > somebody *else* overwritin the ->next pointer because they did a > use-after-free and maybe the dentry_cache is shared with some other > allocation of the same size (SLUB does that, no?) > > Rafael: your last oops does seem to imply that there is some strange > memory corruption going on, because in that case the invalid pointer is > different: instead of being all-ones, it is "fff0810023444c98", which is > not a possible pointer. It very much looks like a single nybble got > cleared (because ffff810023444c98 _would_ be a valid pointer, notice the > "fff0" vs "ffff" prefix). > > So I do suspect it's *some* kind of use-after-free thing. But nothing in > fs/ has changed, so it's not a dentry bug, I think. Which is why my > "preferred" suspect is that "somebody else also does allocations of the > same size as the dentry code, and shares the same SLUB alloc space, and > does something bad". > > So Rafael - are you using SLUB, and if you are, can you enable SLUB_DEBUG, > and then use the "slub_debug" kernel command line to enable it? Sure, I have SLUB_DEBUG on already, rebooting with "slub_debug". Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 2:08 ` Paul E. McKenney 2008-04-21 4:59 ` Paul E. McKenney @ 2008-04-21 15:49 ` Linus Torvalds 2008-04-21 17:05 ` Paul E. McKenney 2008-04-22 1:03 ` Herbert Xu 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 15:49 UTC (permalink / raw) To: Paul E. McKenney Cc: Herbert Xu, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Sun, 20 Apr 2008, Paul E. McKenney wrote: > > > > But I do concede that your version looks clearer, and has the > > benefit that should prefetch ever be optimised out with no side- > > effects, yours would still be correct while the current one will > > lose the barrier completely. > > Agreed as well -- compilers would also be within their right to bypass > the rcu_dereference() around the test/prefetch, which would allow > them to refetch. That is *not* the main problem. If you use "rcu_dereference()" on the wrong access, it not only loses the "smp_read_barrier_depends()" (which is a no-op on all sane architectures anyway), but it loses the ACCESS_ONCE() thing *entirely*. Accessign a local automatic variable through a volatile pointer has absolutely no effect - it's a total no-op apart from possibly generating slightly worse code (although if I were a compiler, I'd just ignore it), since the compiler is totally free to spill and reload the local variable to its memory location - the stack - anyway! So the important part (for sane architectures) of rcu_dereference() is that ACCESS_ONCE() hack, and it _only_ works if you actually do it on the value as it gets loaded from the RCU-protected data structure, not later. So forget about the prefetch, and forget about the barrier. They had nothing to do with the bug. The bug existed even without the prefetch, even in the versions that didn't have it at all. For example, look at the "__list_for_each_rcu()" thing - the bug is there too, because it did just pos = (head)->next ... pos = pos->next where both of those assignments to pos were done without rcu_derefence, so the compiler could happily decide to use the value once, forget it, and then re-load it later (when it might have changed). In other words, the thing I objected to was something much more fundamental than any barriers. It was the fact that "rcu_dereference()" simply *fundamentally* doesn't make sense when done on a local variable, it can only make sense when actually loading the value from the data structure. In short: pos = .. rcu_dereference(pos) is crazy and senseless, but pos = rcu_dereference(pos->next) actually has some logical meaning. Now, all this said, I seriously doubt this was the source of the bug itself. I do not actually really believe that the compiler had much room for reloading things with or without any rcu_dereference(), and I doubt the code generation really changes all that much in practice. (In fact, from a quick look, it seems that the only thing that the incorrect use of "rcu_derference()" did was to force the "node" variable onto the stack, since it did that volatime memory access through its pointer - and fixing the use of rcu_dereference() just means that "node" is kept in a register over the whole loop on x86-64, but the compiler still needs a stack slot, it just picks "str" instead. Which is a much better choice anyway. So what the incorrect use of rcu_dereference() really resulted in was just this insane code (which is also seen in the BUG code): 14: 48 8b 45 d0 mov -0x30(%rbp),%rax 18: 48 8b 00 mov (%rax),%rax 1b: 48 89 45 d0 mov %rax,-0x30(%rbp) 1f: 48 8b 45 d0 mov -0x30(%rbp),%rax 23: 48 85 c0 test %rax,%rax 26: 74 18 je 0x40 and notice how insane that is, and how pointless? First it loads %rax (node) from the ->next pointer of the previous value of 'node' (which is a stack variable at -48(rbp)): # node = node->next mov -0x30(%rbp),%rax mov (%rax),%rax then it saves that to the stack and immediately reloads it (because of the volatile access on "pos" in "rcu_dereference(pos)"): # rcu_dereference(node) mov %rax,-0x30(%rbp) mov -0x30(%rbp),%rax and then it tests it for being NULL: test %rax,%rax je 0x40 and notice how the only thing that rcu_dereference() did was a totally unnecessary store and load to the stack? But also notice how gcc could have done the accesses to "node" *before* this entry as multiple loads from the original because there was nothing really holding this back. (But also notice how there really isn't much room for that in practice, since the code that actually uses "node->next" isn't going to do a whole lot of exciting stuff with it). With the corrected version, the insane "store and immediately reload from stack" goes away" and diffstat on the assembly language actually shows that there are two less instructions (most of the changes are just compiler labels moving around, but there are a few real changes that actually makes the assembler code look a bit more natural too). So to recap: I don't think this mattered in practice. But the code was buggy in theory, even though in practice I don't think it would ever generate any reloads on that "next" variable simply because nothing else than the loop logic really used it. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 15:49 ` Linus Torvalds @ 2008-04-21 17:05 ` Paul E. McKenney 2008-04-21 17:30 ` Linus Torvalds 2008-04-22 1:03 ` Herbert Xu 1 sibling, 1 reply; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 17:05 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Mon, Apr 21, 2008 at 08:49:58AM -0700, Linus Torvalds wrote: > > > On Sun, 20 Apr 2008, Paul E. McKenney wrote: > > > > > > But I do concede that your version looks clearer, and has the > > > benefit that should prefetch ever be optimised out with no side- > > > effects, yours would still be correct while the current one will > > > lose the barrier completely. > > > > Agreed as well -- compilers would also be within their right to bypass > > the rcu_dereference() around the test/prefetch, which would allow > > them to refetch. > > That is *not* the main problem. > > If you use "rcu_dereference()" on the wrong access, it not only loses the > "smp_read_barrier_depends()" (which is a no-op on all sane architectures > anyway), but it loses the ACCESS_ONCE() thing *entirely*. Yep, "compilers would also be within their right to bypass the rcu_dereference()", which as you say, has the ACCESS_ONCE(). > Accessign a local automatic variable through a volatile pointer has > absolutely no effect - it's a total no-op apart from possibly generating > slightly worse code (although if I were a compiler, I'd just ignore it), > since the compiler is totally free to spill and reload the local variable > to its memory location - the stack - anyway! Agreed. The only reasons I can think of for doing rcu_dereference() on a local variable are as follows: 1. The local variable is passed into a called function that is also invoked on shared storage. In this case, the use of rcu_dereference() on a local variable is the cost of common code. 2. The address of the local variable is published globally so that other CPUs can access it under RCU protection. Yes, this is generally insane -- the last time I did this sort of thing was in the early 1980s on a PDP-11, where it was necessary due to that machine's 64K address space. (No, I didn't use RCU on this UP machine, but I did publish locals -- malloc() choked badly in this case.) > So the important part (for sane architectures) of rcu_dereference() is > that ACCESS_ONCE() hack, and it _only_ works if you actually do it on the > value as it gets loaded from the RCU-protected data structure, not later. > > So forget about the prefetch, and forget about the barrier. They had > nothing to do with the bug. The bug existed even without the prefetch, > even in the versions that didn't have it at all. For example, look at the > "__list_for_each_rcu()" thing - the bug is there too, because it did just > > pos = (head)->next ... pos = pos->next > > where both of those assignments to pos were done without rcu_derefence, so > the compiler could happily decide to use the value once, forget it, and > then re-load it later (when it might have changed). > > In other words, the thing I objected to was something much more > fundamental than any barriers. It was the fact that "rcu_dereference()" > simply *fundamentally* doesn't make sense when done on a local variable, > it can only make sense when actually loading the value from the data > structure. > > In short: > > pos = .. > > rcu_dereference(pos) > > is crazy and senseless, but > > pos = rcu_dereference(pos->next) > > actually has some logical meaning. Agreed. > Now, all this said, I seriously doubt this was the source of the bug > itself. I do not actually really believe that the compiler had much room > for reloading things with or without any rcu_dereference(), and I doubt > the code generation really changes all that much in practice. Agreed -- you would have to have an uncommonly aggressive compiler to get this to happen. Seems like it would be worth trying the patch, though. I did take a quick look for improperly freeing dentries -- unhashed dentries are freed directly, so if there is a code path that somehow unhashes dentries and then d_free()s them without a grace period, we have a problem. Hmmmm... This could happen if someone called the final dput() on a dentry that was hashed. If this can really happen, the crude and untested patch below might help. > (In fact, from a quick look, it seems that the only thing that > the incorrect use of "rcu_derference()" did was to force the "node" > variable onto the stack, since it did that volatime memory access through > its pointer - and fixing the use of rcu_dereference() just means that > "node" is kept in a register over the whole loop on x86-64, but the > compiler still needs a stack slot, it just picks "str" instead. Which is a > much better choice anyway. > > So what the incorrect use of rcu_dereference() really resulted in was just > this insane code (which is also seen in the BUG code): > > 14: 48 8b 45 d0 mov -0x30(%rbp),%rax > 18: 48 8b 00 mov (%rax),%rax > 1b: 48 89 45 d0 mov %rax,-0x30(%rbp) > 1f: 48 8b 45 d0 mov -0x30(%rbp),%rax > 23: 48 85 c0 test %rax,%rax > 26: 74 18 je 0x40 > > and notice how insane that is, and how pointless? Yep. Ugly as sin. > First it loads %rax (node) from the ->next pointer of the previous value > of 'node' (which is a stack variable at -48(rbp)): > > # node = node->next > mov -0x30(%rbp),%rax > mov (%rax),%rax > > then it saves that to the stack and immediately reloads it (because of the > volatile access on "pos" in "rcu_dereference(pos)"): > > # rcu_dereference(node) > mov %rax,-0x30(%rbp) > mov -0x30(%rbp),%rax > > and then it tests it for being NULL: > > test %rax,%rax > je 0x40 > > and notice how the only thing that rcu_dereference() did was a totally > unnecessary store and load to the stack? But also notice how gcc could > have done the accesses to "node" *before* this entry as multiple loads > from the original because there was nothing really holding this back. > > (But also notice how there really isn't much room for that in practice, > since the code that actually uses "node->next" isn't going to do a whole > lot of exciting stuff with it). > > With the corrected version, the insane "store and immediately reload from > stack" goes away" and diffstat on the assembly language actually shows > that there are two less instructions (most of the changes are just > compiler labels moving around, but there are a few real changes that > actually makes the assembler code look a bit more natural too). > > So to recap: I don't think this mattered in practice. But the code was > buggy in theory, even though in practice I don't think it would ever > generate any reloads on that "next" variable simply because nothing else > than the loop logic really used it. Rafael, does the following (crude, untested, probably does not even compile) patch help? Thanx, Paul Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- dcache.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff -urpNa -X dontdiff linux-2.6.25/fs/dcache.c linux-2.6.25-d_free/fs/dcache.c --- linux-2.6.25/fs/dcache.c 2008-04-16 19:49:44.000000000 -0700 +++ linux-2.6.25-d_free/fs/dcache.c 2008-04-21 09:57:53.000000000 -0700 @@ -88,11 +88,7 @@ static void d_free(struct dentry *dentry { if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); - /* if dentry was never inserted into hash, immediate free is OK */ - if (hlist_unhashed(&dentry->d_hash)) - __d_free(dentry); - else - call_rcu(&dentry->d_u.d_rcu, d_callback); + call_rcu(&dentry->d_u.d_rcu, d_callback); } static void dentry_lru_remove(struct dentry *dentry) ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:05 ` Paul E. McKenney @ 2008-04-21 17:30 ` Linus Torvalds 2008-04-21 17:43 ` Paul E. McKenney 0 siblings, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 17:30 UTC (permalink / raw) To: Paul E. McKenney Cc: Herbert Xu, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Mon, 21 Apr 2008, Paul E. McKenney wrote: > > I did take a quick look for improperly freeing dentries -- unhashed > dentries are freed directly, so if there is a code path that somehow > unhashes dentries and then d_free()s them without a grace period, we > have a problem. No, not even then. We *always* unhash the dentries before freeing them, but we very consciously use "hlist_del_rcu()" on them, not "hlist_del_init()". That, in turn, will mean that the "pprev" pointer will still be set, so the "hlist_unhashed()" thing will *not* trigger. IOW, when we do that direct-free with: if (hlist_unhashed(&dentry->d_hash)) __d_free(dentry); the "hlist_unhashed()" will literally guarantee that i has *never* been on a hash-list at all! (If you want to test whether it is currently unhashed or not, you actually have to use "d_unhashed()" on the dentry under the dentry lock, which tests the DCACHE_UNHASHED bit). Of course, there could be some bug in there, but the thing is, none of this has even changed in a long time, certainly not since 2.6.25. Which is why I think the dcache code is all fine, and the bug comes from somewhere else corrupting the data structures. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:30 ` Linus Torvalds @ 2008-04-21 17:43 ` Paul E. McKenney 0 siblings, 0 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 17:43 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Mon, Apr 21, 2008 at 10:30:19AM -0700, Linus Torvalds wrote: > > > On Mon, 21 Apr 2008, Paul E. McKenney wrote: > > > > I did take a quick look for improperly freeing dentries -- unhashed > > dentries are freed directly, so if there is a code path that somehow > > unhashes dentries and then d_free()s them without a grace period, we > > have a problem. > > No, not even then. > > We *always* unhash the dentries before freeing them, but we very > consciously use "hlist_del_rcu()" on them, not "hlist_del_init()". > > That, in turn, will mean that the "pprev" pointer will still be set, so > the "hlist_unhashed()" thing will *not* trigger. > > IOW, when we do that direct-free with: > > if (hlist_unhashed(&dentry->d_hash)) > __d_free(dentry); > > the "hlist_unhashed()" will literally guarantee that i has *never* been on > a hash-list at all! Got it, hlist_del_rcu() sets ->pprev to LIST_POISON2, which is non-NULL, so the dentry still gets to wait for a grace period. Color me blind!!! > (If you want to test whether it is currently unhashed or not, you actually > have to use "d_unhashed()" on the dentry under the dentry lock, which > tests the DCACHE_UNHASHED bit). And as it looks like you guessed, I was misreading the hlist_unhashed() above as d_unhashed(). :-/ Thanx, Paul > Of course, there could be some bug in there, but the thing is, none of > this has even changed in a long time, certainly not since 2.6.25. Which is > why I think the dcache code is all fine, and the bug comes from somewhere > else corrupting the data structures. > > Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 15:49 ` Linus Torvalds 2008-04-21 17:05 ` Paul E. McKenney @ 2008-04-22 1:03 ` Herbert Xu 2008-04-22 13:36 ` Paul E. McKenney 1 sibling, 1 reply; 183+ messages in thread From: Herbert Xu @ 2008-04-22 1:03 UTC (permalink / raw) To: Linus Torvalds Cc: Paul E. McKenney, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Mon, Apr 21, 2008 at 08:49:58AM -0700, Linus Torvalds wrote: > > That is *not* the main problem. > > If you use "rcu_dereference()" on the wrong access, it not only loses the > "smp_read_barrier_depends()" (which is a no-op on all sane architectures > anyway), but it loses the ACCESS_ONCE() thing *entirely*. Actually rcu_dereference didn't have ACCESS_ONCE when I did this. That only appearaed later with the preemptible RCU work. The original purpose of rcu_dereference was exactly to replace the explicit barriers that people were using for RCU, nothing more, nothing less. Oh and I totally agree that the compiler is going to generate insane code whenever ACCESS_ONCE is used. In this case we may have avoided it by rearranging the code, but in general the introduction of ACCESS_ONCE in rcu_dereference is likely to have a negative impact on the code generated. Remember that "volatile" discussion? I think this is where it all came from. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 1:03 ` Herbert Xu @ 2008-04-22 13:36 ` Paul E. McKenney 0 siblings, 0 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-22 13:36 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4 On Tue, Apr 22, 2008 at 09:03:04AM +0800, Herbert Xu wrote: > On Mon, Apr 21, 2008 at 08:49:58AM -0700, Linus Torvalds wrote: > > > > That is *not* the main problem. > > > > If you use "rcu_dereference()" on the wrong access, it not only loses the > > "smp_read_barrier_depends()" (which is a no-op on all sane architectures > > anyway), but it loses the ACCESS_ONCE() thing *entirely*. > > Actually rcu_dereference didn't have ACCESS_ONCE when I did this. > That only appearaed later with the preemptible RCU work. Yep, ACCESS_ONCE() is quite recent -- within the last year. So I should have modified the list_for_each.*rcu() macros when I made that change. > The original purpose of rcu_dereference was exactly to replace the > explicit barriers that people were using for RCU, nothing more, > nothing less. > > Oh and I totally agree that the compiler is going to generate insane > code whenever ACCESS_ONCE is used. In this case we may have avoided > it by rearranging the code, but in general the introduction of ACCESS_ONCE > in rcu_dereference is likely to have a negative impact on the code > generated. > > Remember that "volatile" discussion? I think this is where it all came > from. And I still have the bug in to gcc: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33102 Interesting, currently in status "unconfirmed"... I guess I should supply a test case. Thanx, Paul ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-20 21:31 ` Linus Torvalds 2008-04-21 1:18 ` Herbert Xu @ 2008-04-21 16:12 ` Rafael J. Wysocki 2008-04-21 16:54 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-21 16:12 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney On Sunday, 20 of April 2008, Linus Torvalds wrote: > > On Sun, 20 Apr 2008, Rafael J. Wysocki wrote: > > > > I've just got the following traces from 2.6.25-git2 on HP nx6325 (64-bit). > > I think they are related to the hang I described yesterday: > > > > [12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff > > Something has added a dentry pointer that has the value -1 to the dentry > hash list. The access that oopses seems to be the > > prefetch(pos->next) > > which is part of hlist_for_each_entry_rcu(), where "pos" is -1. > > I suspect it's an RCU error, ie somebody has released a dentry entry, and > free'd it without waiting for the RCU grace period. > > Talking about RCU I also think that whoever did those "rcu_dereference()" > macros in <linux/list.h> was insane. It's totally pointless to do > "rcu_dereference()" on a local variable. It simply *cannot* make sense. > Herbert, Paul, you guys should look at it. > > As far as I can tell, rcu_dereference() should _always_ be done when we > access the "next" pointer (except for when prefetching, where we simply > don't care). > > Paul? Herbert? Totally untested patch appended. > > NOTE! I do not expect this patch to matter for this oops. There's > something else going on there. Well, it seems that the oops is actually known from -mm: http://lkml.org/lkml/2008/4/21/55 and something similar was observed with 2.6.25-rc8-mm2. Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 16:12 ` Rafael J. Wysocki @ 2008-04-21 16:54 ` Linus Torvalds 2008-04-21 17:06 ` Jiri Slaby 2008-04-21 20:39 ` David Miller 0 siblings, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 16:54 UTC (permalink / raw) To: Rafael J. Wysocki Cc: LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, Jiri Slaby, David S. Miller On Mon, 21 Apr 2008, Rafael J. Wysocki wrote: > > Well, it seems that the oops is actually known from -mm: > > http://lkml.org/lkml/2008/4/21/55 > > and something similar was observed with 2.6.25-rc8-mm2. Hmm. Sadly, I doubt that really cuts down the suspect list very much. Most of what has been merged since 2.6.25 has been in -mm, so while I agree that it looks very similar, the fact that it was possibly already in -rc8-mm2 doesn't much _help_. And in fact, those oopses in rc8-mm2 don't look _that_ similar. Those are a corrupt f_mapping structure, it looks like (ie it looks like either "struct address_space" or a "struct filp" rather than a "struct dentry"). What is interesting about Jiri's version of the bug is that he has another value for the corruption than you do: you had either all-ones, or a value that *looked* like possibly a single nybble got cleared. Jiri, in contrast, has a value of 00f0000000000000. Which is a bit interesting in that it's again a *nybble* that looks corrupt, but it's a different one. But assuming Jiri's two oopses are related (which is not entirely unlikely), and assuming that this is a SLUB bucket re-use, then it's quite likely that the reason that his -rc8-mm2 oops looks different just because it was yet _another_ allocation that was in the same bucket. If so, the most likely one is "struct filp", because it has the right size: for me a filp is in the 192-byte bucket, which is very close to the 208-byte bucket of dentry. So I could imagine that some config option or other change just changed the sizes around so that the two types ended up in different buckets in rc8-mm2 and in 2.6.25-mm1 (ie neither the dentry nor the filp necessarily changed sizes, but the *corrupting* type perhaps did?) What I find interesting is that at least for me, I have the SLAB bucket size for nf_conntrack_expect being 208 bytes. And the *biggest* merge by far after 2.6.25 so far has been networking (and conntrack in particular) Is that a smoking gun? Not necessarily. But it *is* intriguing. But there are other possible clashes (the 192-byte bucket has several different suspects, and not all of them are in networking).1 Jiri and Davem added to the Cc. Jiri - could you also confirm whether you are usign SLUB (which is not necessarily at all indicative of a SLUB bug itself - it's just that SLAB won't ever even merge different allocations of the same size into the same buckets, so if it's a cross-slab corruption, you'd simply never see it with SLAB). And if you are, can you please enable SLUB_DEBUG, and add a "slub_debug" to your kernel command line to enable all the debugging? That would hopefully catch any obvious use-after-free corruption. I'm just whistling in the dark here, but it does seem worth pursuing this approach. The VFS layer has not changed *at*all* since 2.6.25, so I seriously doubt it's a dentry or filp bug - I think the corruption is external. And while networking is certainly not the only suspect (the x86 architecture changes are pretty extensive too), the allocation size thing certainly makes it intriguing. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 16:54 ` Linus Torvalds @ 2008-04-21 17:06 ` Jiri Slaby 2008-04-21 17:19 ` Rafael J. Wysocki 2008-04-21 17:48 ` Linus Torvalds 2008-04-21 20:39 ` David Miller 1 sibling, 2 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 17:06 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, David S. Miller On 04/21/2008 06:54 PM, Linus Torvalds wrote: > Jiri - could you also confirm whether you are usign SLUB (which is not > necessarily at all indicative of a SLUB bug itself - it's just that SLAB > won't ever even merge different allocations of the same size into the same > buckets, so if it's a cross-slab corruption, you'd simply never see it > with SLAB). Yeah, I'm using slub. Going to boot to slub_debug. Thanks so far. BTW. I haven't see this without suspend/resume cycle, do you, Rafael? It doesn't mean anything, since it needs longer time to trigger, but anyway, it might be a clue. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:06 ` Jiri Slaby @ 2008-04-21 17:19 ` Rafael J. Wysocki 2008-04-21 17:48 ` Linus Torvalds 1 sibling, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-21 17:19 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, David S. Miller On Monday, 21 of April 2008, Jiri Slaby wrote: > On 04/21/2008 06:54 PM, Linus Torvalds wrote: > > Jiri - could you also confirm whether you are usign SLUB (which is not > > necessarily at all indicative of a SLUB bug itself - it's just that SLAB > > won't ever even merge different allocations of the same size into the same > > buckets, so if it's a cross-slab corruption, you'd simply never see it > > with SLAB). > > Yeah, I'm using slub. Going to boot to slub_debug. > > Thanks so far. > > BTW. I haven't see this without suspend/resume cycle, do you, Rafael? Well, I've seen it only once so far. :-) > It doesn't mean anything, since it needs longer time to trigger, but anyway, > it might be a clue. I think we need some more data anyway. Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:06 ` Jiri Slaby 2008-04-21 17:19 ` Rafael J. Wysocki @ 2008-04-21 17:48 ` Linus Torvalds 2008-04-21 18:22 ` Rafael J. Wysocki 2008-04-21 19:38 ` Jiri Slaby 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 17:48 UTC (permalink / raw) To: Jiri Slaby Cc: Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, David S. Miller On Mon, 21 Apr 2008, Jiri Slaby wrote: > > BTW. I haven't see this without suspend/resume cycle, do you, Rafael? It > doesn't mean anything, since it needs longer time to trigger, but anyway, it > might be a clue. There's a separate (and very different-looking) bug-report about the atl1 driver having problems when doing an "ifconfig down" on it. In fact, the problem report says: > With this commit in tree, I can reproduce either > a) kmalloc-2048 corruption after initscripts shutdown eth0 > http://marc.info/?l=linux-kernel&m=120820360221261&w=2 > > b) or oopses at filp_close() first reported long ago > (sorry, can't find that email) where that "or oopses at filp_close()" thing is somewhat interesting, since your original bug was about something that looked like file pointer corruption. Now, I doubt you have an ATL chip, and I doubt the two are _really_ related in any way (the ATL bug was actually triggered by enabling 64-bit DMA), but the filp_close thing makes me go "hmm". The two affected corrupted SLUB areas were the 2kB allocation (1560-byte ethernet packets plus skb_shared_info overhead, anyone?) and apparently the one that filp's are in (perhaps a 20-byte TCP ACK packet or other "small" packet + the skb_shared_info overhead would be a common case that might be in that 200-byte range?) Maybe the ATL bug isn't ATL-specific at all, but somehow connected to NETIF_F_HIGHDMA. Do you have 4GB+ of RAM? And one thing that suspend/resume does, which is not necessarily commonly done during normal operation, is that ifconfig down/up pattern. Maybe there is something broken in general there? Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:48 ` Linus Torvalds @ 2008-04-21 18:22 ` Rafael J. Wysocki 2008-04-21 19:38 ` Jiri Slaby 1 sibling, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-21 18:22 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, David S. Miller On Monday, 21 of April 2008, Linus Torvalds wrote: > > On Mon, 21 Apr 2008, Jiri Slaby wrote: > > > > BTW. I haven't see this without suspend/resume cycle, do you, Rafael? It > > doesn't mean anything, since it needs longer time to trigger, but anyway, it > > might be a clue. > > There's a separate (and very different-looking) bug-report about the atl1 > driver having problems when doing an "ifconfig down" on it. In fact, the > problem report says: > > > With this commit in tree, I can reproduce either > > a) kmalloc-2048 corruption after initscripts shutdown eth0 > > http://marc.info/?l=linux-kernel&m=120820360221261&w=2 > > > > b) or oopses at filp_close() first reported long ago > > (sorry, can't find that email) > > where that "or oopses at filp_close()" thing is somewhat interesting, > since your original bug was about something that looked like file pointer > corruption. > > Now, I doubt you have an ATL chip, and I doubt the two are _really_ > related in any way (the ATL bug was actually triggered by enabling 64-bit > DMA), but the filp_close thing makes me go "hmm". > > The two affected corrupted SLUB areas were the 2kB allocation (1560-byte > ethernet packets plus skb_shared_info overhead, anyone?) and apparently > the one that filp's are in (perhaps a 20-byte TCP ACK packet or other > "small" packet + the skb_shared_info overhead would be a common case that > might be in that 200-byte range?) > > Maybe the ATL bug isn't ATL-specific at all, but somehow connected to > NETIF_F_HIGHDMA. Do you have 4GB+ of RAM? > > And one thing that suspend/resume does, which is not necessarily commonly > done during normal operation, is that ifconfig down/up pattern. Maybe > there is something broken in general there? Hm, that may be the case. In fact, I've cut the messages that precede the oops from the dmesg output, but they are from the b43 driver and the firewall (the full oops below is reproduced for completness): [12736.964336] b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10) [12737.692435] b43-phy0 debug: Chip initialized [12737.692659] b43-phy0 debug: 32-bit DMA initialized [12742.213601] Registered led device: b43-phy0::tx [12742.216372] Registered led device: b43-phy0::rx [12742.216559] Registered led device: b43-phy0::radio [12742.216587] b43-phy0 debug: Wireless interface started [12737.724614] b43-phy0 ERROR: PHY transmission error [12737.764440] b43-phy0 ERROR: PHY transmission error [12738.469683] b43-phy0 debug: Switching to 2.4-GHz band [12738.469755] b43-phy0 debug: Wireless interface stopped [12738.469958] b43-phy0 debug: DMA-32 rx_ring: Used slots 0/64, Failed frames 0/0 = 0.0%, Average tries 0.00 [12738.470020] b43-phy0 debug: DMA-32 tx_ring_AC_BK: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00 [12738.476448] b43-phy0 debug: DMA-32 tx_ring_AC_BE: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00 [12738.484436] b43-phy0 debug: DMA-32 tx_ring_AC_VI: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00 [12738.492433] b43-phy0 debug: DMA-32 tx_ring_AC_VO: Used slots 2/128, Failed frames 0/13 = 0.0%, Average tries 1.00 [12738.500433] b43-phy0 debug: DMA-32 tx_ring_mcast: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00 [12738.668447] b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10) [12739.892834] b43-phy0 debug: Chip initialized [12739.893099] b43-phy0 debug: 32-bit DMA initialized [12739.916479] Registered led device: b43-phy0::tx [12739.919263] Registered led device: b43-phy0::rx [12739.919329] Registered led device: b43-phy0::radio [12739.919372] b43-phy0 debug: Wireless interface started [12739.968824] wlan0: Initial auth_alg=0 [12739.968832] wlan0: authenticate with AP 00:17:9a:f3:b5:75 [12739.970261] wlan0: RX authentication from 00:17:9a:f3:b5:75 (alg=0 transaction=2 status=0) [12739.970266] wlan0: authenticated [12739.970269] wlan0: associate with AP 00:17:9a:f3:b5:75 [12739.972403] wlan0: RX AssocResp from 00:17:9a:f3:b5:75 (capab=0x431 status=0 aid=1) [12739.972408] wlan0: associated [12739.972420] wlan0: switched to short barker preamble (BSSID=00:17:9a:f3:b5:75) [12739.972954] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready [12750.001285] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12750.125294] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=368 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=348 [12750.161238] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=254 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=234 [12750.381280] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=368 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=348 [12750.637329] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=368 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=348 [12757.297378] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=180 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=160 [12757.497389] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12757.553399] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=180 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=160 [12757.809407] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=180 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=160 [12757.997557] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=378 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=358 [12766.069845] wlan0: no IPv6 routers present [12777.783641] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12793.792438] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=01:00:5e:00:00:fb:00:13:8f:3a:0b:96:08:00 SRC=192.168.100.1 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12817.529134] SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC= SRC=192.168.100.119 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 [12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff [12844.066765] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.066775] PGD 203067 PUD 204067 PMD 0 [12844.066778] Oops: 0000 [1] SMP DEBUG_PAGEALLOC [12844.066782] CPU 1 [12844.066784] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core [12844.066854] Pid: 13078, comm: kio_file Tainted: G M 2.6.25 #401 [12844.066857] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.066861] RSP: 0018:ffff810064c5dc08 EFLAGS: 00010286 [12844.066863] RAX: ffffffffffffffff RBX: ffff8100f0bd7e10 RCX: 0000000000000012 [12844.066866] RDX: ffffffffffffffff RSI: ffff810064c5dd08 RDI: ffff810053304000 [12844.066868] RBP: ffff810064c5dc58 R08: 0000000000000003 R09: 0000000000000001 [12844.066871] R10: 0000000000000000 R11: 0000000000000246 R12: ffff810053304000 [12844.066873] R13: ffff810064c5dd08 R14: 000000005b3d8b1c R15: 000000000000001a [12844.066876] FS: 00007f08e0719700(0000) GS:ffff81007782d480(0000) knlGS:0000000000000000 [12844.066879] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [12844.066881] CR2: ffffffffffffffff CR3: 000000006a4f2000 CR4: 00000000000006a0 [12844.066884] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12844.066886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [12844.066889] Process kio_file (pid: 13078, threadinfo ffff810064c5c000, task ffff81005a8c8000) [12844.066891] Stack: ffff81000cdde000 000000000000001a ffff8100504a3000 000000000e310f76 [12844.066897] ffffffffffffffff ffff810068c941c0 ffff810064c5de38 ffff8100533050c8 [12844.066901] 0000000000000000 ffff810064c5de38 ffff810064c5dca8 ffffffff8029e236 [12844.066905] Call Trace: [12844.066919] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 [12844.066930] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd [12844.066955] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a [12844.066961] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c [12844.066973] [<ffffffff802a09ec>] path_walk+0x61/0xc3 [12844.066981] [<ffffffff802a0cd2>] do_path_lookup+0x15d/0x1d9 [12844.066991] [<ffffffff802a161a>] __user_walk_fd+0x41/0x5c [12844.067000] [<ffffffff8029a252>] vfs_lstat_fd+0x24/0x5a [12844.067007] [<ffffffff8030b30d>] ? _atomic_dec_and_lock+0x3d/0x5c [12844.067013] [<ffffffff802abe02>] ? mntput_no_expire+0x20/0x8b [12844.067019] [<ffffffff8029dfe8>] ? path_put+0x2c/0x30 [12844.067021] [<ffffffff802b128d>] ? sys_getxattr+0x60/0x75 [12844.067021] [<ffffffff8029a2aa>] sys_newlstat+0x22/0x3c [12844.067021] [<ffffffff8020bf1b>] system_call_after_swapgs+0x7b/0x80 [12844.067021] [12844.067021] [12844.067021] Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff [12844.067021] RIP [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117 [12844.067021] RSP <ffff810064c5dc08> [12844.067021] CR2: ffffffffffffffff [12844.067021] ---[ end trace 02645136ff144df9 ]--- Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 17:48 ` Linus Torvalds 2008-04-21 18:22 ` Rafael J. Wysocki @ 2008-04-21 19:38 ` Jiri Slaby 1 sibling, 0 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 19:38 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, LKML, Ingo Molnar, Andrew Morton, linux-ext4, Herbert Xu, Paul E. McKenney, David S. Miller On 04/21/2008 07:48 PM, Linus Torvalds wrote: > And one thing that suspend/resume does, which is not necessarily commonly > done during normal operation, is that ifconfig down/up pattern. Maybe > there is something broken in general there? Who knows, unfortunately it seems so. I've found another two oopses related to this in logs (they are below). Again dentry + offsetof(dentry, name) address is broken here and it fires up in memcmp. I suspect somebody still uses that bucket (assigned now to dentry) as it hasn't ever be freed and overwrites its members. I also had corrupted include/linux/irq.h file. There was irq_has_<some_ugly_utf_char>ction or something like that. I don't remember the the exact function name, but compilation failed and it didn't when I compiled the kernel for the first time -- I use that tree everyday, the corruption must happen that day. Anyway I have no idea if this is related. BUG: unable to handle kernel paging request at ffff81f02003f16c IP: [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 PGD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/platform/coretemp.1/temp1_input CPU 1 Modules linked in: ppdev parport tun bitrev ipv6 test arc4 ecb crypto_blkcipher cryptomgr crypto_algapi ath5k mac80211 crc32 rtc_cmos sr_mod ohci1394 rtc_core usbhid rtc_lib ieee1394 cdrom cfg80211 hid usblp ehci_hcd ff_memless floppy [last unloaded: vmnet] Pid: 3710, comm: sensors-applet Tainted: P 2.6.25-rc8-mm2_64 #399 RIP: 0010:[<ffffffff802ad7d5>] [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 RSP: 0018:ffff810057973b98 EFLAGS: 00010246 RAX: 0000000000000017 RBX: ffff81002003f0e0 RCX: 0000000000000017 RDX: 0000000000000017 RSI: ffff81f02003f16c RDI: ffff8100036f7022 RBP: ffff810057973bf8 R08: ffff810057973ca8 R09: 0000000000000000 R10: 00000000000000d8 R11: 0000000000000246 R12: ffff81002003f0c8 R13: 00000000910b9880 R14: ffff810035a5ded8 R15: ffff810057973bc8 FS: 00007f6e2b7266f0(0000) GS:ffff81007d006580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff81f02003f16c CR3: 000000005788a000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sensors-applet (pid: 3710, threadinfo ffff810057972000, task ffff810062ace9e0) Stack: ffff810057973ca8 0000000000000017 ffff81002003f0d0 000000176767e000 ffff8100036f7022 ffffffff8047a695 ffff81002003f0e0 0000000000000001 ffff810057973e48 ffff810057973e48 ffff810057973ca8 ffff810057973cb8 Call Trace: [<ffffffff8047a695>] ? skb_release_data+0x85/0xd0 [<ffffffff802a2b95>] do_lookup+0x35/0x220 [<ffffffff802a2fd2>] __link_path_walk+0x252/0x1010 [<ffffffff8022b4d0>] ? default_wake_function+0x0/0x10 [<ffffffff802a3dfe>] path_walk+0x6e/0xe0 [<ffffffff802a40c2>] do_path_lookup+0xa2/0x240 [<ffffffff802a45c7>] __path_lookup_intent_open+0x67/0xd0 [<ffffffff802a463c>] path_lookup_open+0xc/0x10 [<ffffffff802a558a>] do_filp_open+0xaa/0x990 [<ffffffff80281778>] ? unmap_region+0x138/0x160 [<ffffffff80296aec>] ? get_unused_fd_flags+0x8c/0x140 [<ffffffff80296c16>] do_sys_open+0x76/0x110 [<ffffffff80296cdb>] sys_open+0x1b/0x20 [<ffffffff8020b88b>] system_call_after_swapgs+0x7b/0x80 Code: 89 e0 48 8b 55 b0 fe 02 eb ae 0f 1f 40 00 8b 45 bc 41 39 44 24 34 75 8d 48 8b 55 a8 49 8b 74 24 38 48 39 d2 48 8b 7d c0 48 89 d1 <f3> a6 0f 85 72 ff ff ff eb bb 90 55 48 89 e5 41 55 49 89 fd 41 RIP [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 RSP <ffff810057973b98> CR2: ffff81f02003f16c ---[ end trace 9c63388ed58b7c09 ]--- BUG: unable to handle kernel paging request at fffff0002008493c IP: [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 PGD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/virtual/net/tun0/statistics/collisions CPU 0 Modules linked in: ipv6 tun bitrev test arc4 ecb crypto_blkcipher cryptomgr crypto_algapi ath5k mac80211 usbhid ohci1394 rtc_cmos crc32 sr_mod rtc_core ehci_hcd hid ieee1394 rtc_lib floppy cdrom cfg80211 ff_memless Pid: 12427, comm: find Not tainted 2.6.25-rc8-mm2_64 #399 RIP: 0010:[<ffffffff802ad7d5>] [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 RSP: 0018:ffff81001a01bbf8 EFLAGS: 00010246 RAX: 0000000000000010 RBX: ffff8100200848b0 RCX: 0000000000000010 RDX: 0000000000000010 RSI: fffff0002008493c RDI: ffff81003dae9000 RBP: ffff81001a01bc58 R08: ffff81001a01bd08 R09: 0000000000000000 R10: 000000000000003f R11: 0000000000000246 R12: ffff810020084898 R13: 000000009047ba33 R14: ffff810020087d48 R15: ffff81001a01bc28 FS: 00007ff2f3a226f0(0000) GS:ffffffff80657000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: fffff0002008493c CR3: 000000001d512000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process find (pid: 12427, threadinfo ffff81001a01a000, task ffff81007d210790) Stack: ffff81001a01bd08 0000000000000010 ffff8100200848a0 0000001000000001 ffff81003dae9000 0000000000000082 ffff8100200848b0 0000000000000001 ffff81001a01be38 ffff81001a01be38 ffff81001a01bd08 ffff81001a01bd18 Call Trace: [<ffffffff802a2b95>] do_lookup+0x35/0x220 [<ffffffff802ae0a8>] ? dput+0x38/0x180 [<ffffffff802a2fd2>] __link_path_walk+0x252/0x1010 [<ffffffff802aec77>] ? file_update_time+0xc7/0x130 [<ffffffff802b2daa>] ? mntput_no_expire+0x2a/0x140 [<ffffffff802a3dfe>] path_walk+0x6e/0xe0 [<ffffffff802a40c2>] do_path_lookup+0xa2/0x240 [<ffffffff802a505c>] __user_walk_fd+0x4c/0x80 [<ffffffff8029c71b>] vfs_lstat_fd+0x2b/0x70 [<ffffffff8029c8f3>] ? cp_new_stat+0xe3/0xf0 [<ffffffff8029c95c>] sys_newfstatat+0x5c/0x80 [<ffffffff8020b88b>] system_call_after_swapgs+0x7b/0x80 Code: 89 e0 48 8b 55 b0 fe 02 eb ae 0f 1f 40 00 8b 45 bc 41 39 44 24 34 75 8d 48 8b 55 a8 49 8b 74 24 38 48 39 d2 48 8b 7d c0 48 89 d1 <f3> a6 0f 85 72 ff ff ff eb bb 90 55 48 89 e5 41 55 49 89 fd 41 RIP [<ffffffff802ad7d5>] __d_lookup+0x155/0x160 RSP <ffff81001a01bbf8> CR2: fffff0002008493c ---[ end trace 1e48f32334002427 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 16:54 ` Linus Torvalds 2008-04-21 17:06 ` Jiri Slaby @ 2008-04-21 20:39 ` David Miller 2008-04-21 21:18 ` Jiri Slaby 2008-04-21 21:19 ` Linus Torvalds 1 sibling, 2 replies; 183+ messages in thread From: David Miller @ 2008-04-21 20:39 UTC (permalink / raw) To: torvalds Cc: rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, jirislaby From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) > What I find interesting is that at least for me, I have the SLAB bucket > size for nf_conntrack_expect being 208 bytes. And the *biggest* merge by > far after 2.6.25 so far has been networking (and conntrack in particular) > > Is that a smoking gun? Not necessarily. But it *is* intriguing. But there > are other possible clashes (the 192-byte bucket has several different > suspects, and not all of them are in networking).1 I think you might be onto something here. The "mask" member of struct nf_conntrack_expect could be reasonably all 1's like the value reported in the crash that begins this thread. Do we know the offset within the object at which this all 1's value is found? My rough calculations show that on 32-bit that expect->mask member is at offset 56 and on 64-bit it should be at offset 72. Does that match up to the offset of the filp or whatever bit being corrupted? I'll scan through the netfilter changesets in post 2.6.25 to see if anything stands out. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 20:39 ` David Miller @ 2008-04-21 21:18 ` Jiri Slaby 2008-04-21 21:58 ` Jiri Slaby 2008-04-21 21:19 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 21:18 UTC (permalink / raw) To: David Miller Cc: torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, Zdenek Kabelac On 04/21/2008 10:39 PM, David Miller wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) > >> What I find interesting is that at least for me, I have the SLAB bucket >> size for nf_conntrack_expect being 208 bytes. And the *biggest* merge by >> far after 2.6.25 so far has been networking (and conntrack in particular) >> >> Is that a smoking gun? Not necessarily. But it *is* intriguing. But there >> are other possible clashes (the 192-byte bucket has several different >> suspects, and not all of them are in networking).1 > > I think you might be onto something here. > > The "mask" member of struct nf_conntrack_expect could be reasonably > all 1's like the value reported in the crash that begins this > thread. > > Do we know the offset within the object at which this all 1's > value is found? > > My rough calculations show that on 32-bit that expect->mask member is > at offset 56 and on 64-bit it should be at offset 72. Does that > match up to the offset of the filp or whatever bit being corrupted? dentry.d_name.name is 56 on 64-bit (my memcmp crashes) dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) dentry.d_op is 136 (crash below) It's spreading :/. ---------- Forwarded message ---------- From: Zdenek Kabelac <zdenek.kabelac@gmail.com> Date: 21.4.2008 11:14 Subject: BUG: unable to handle kernel NULL pointer at d_free+0x18/0x80 To: Kernel development list <linux-kernel@vger.kernel.org> Hello This oops appeared in my log - unsure how it is related to my DVB-T tuner test before. But I've also seen another weird resume with some similar crash. Happens with 2.6.25 - commit 48a86f548fb74928f9a466f52527e83fecdb4575 (T61, 2GB) BUG: unable to handle kernel NULL pointer dereference at 0000000000000110 IP: [d_free+24/128] d_free+0x18/0x80 PGD 0 Oops: 0000 [1] PREEMPT SMP CPU 0 Modules linked in: usb_storage dvb_usb_af9015 dvb_usb_dibusb_common dib3000mc dibx000_common dvb_usb dvb_core tun nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel arc4 ecb snd_seq_oss crypto_blkcipher snd_seq_midi_event snd_seq cryptomgr snd_seq_device snd_pcm_oss crypto_algapi iwl3945 mac80211 e1000e psmouse snd_mixer_oss rtc_cmos evdev rtc_core thinkpad_acpi video snd_pcm mmc_block sdhci mmc_core snd_timer iTCO_wdt iTCO_vendor_support battery backlight nvram rtc_lib i2c_i801 i2c_core ac snd soundcore snd_page_alloc intel_agp output serio_raw cfg80211 button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: dvb_core] Pid: 210, comm: kswapd0 Not tainted 2.6.25 #56 RIP: 0010:[d_free+24/128] [d_free+24/128] d_free+0x18/0x80 RSP: 0018:ffff81007ced9cf0 EFLAGS: 00010206 RAX: 00000000000000f0 RBX: ffff8100202723d8 RCX: 0000000000000132 RDX: 0000000000005e5d RSI: ffff81007ced4048 RDI: ffff8100202723d8 RBP: ffff81007ced9d00 R08: 0000000000000002 R09: d37a6f4de9bd37a7 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100202723d8 R13: ffff81007c9329d8 R14: ffff8100202723e0 R15: 0000000000000029 FS: 0000000000000000(0000) GS:ffffffff81486000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000110 CR3: 0000000001001000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kswapd0 (pid: 210, threadinfo ffff81007ced8000, task ffff81007ced4000) Stack: ffff8100202723d8 ffff81007b7e73d8 ffff81007ced9d20 ffffffff810cb3eb ffff8100202723d8 0000000000000000 ffff81007ced9d40 ffffffff810cb4d5 ffff8100202723d8 ffff8100202723d8 ffff81007ced9d80 ffffffff810cb642 Call Trace: [d_kill+59/96] d_kill+0x3b/0x60 [prune_one_dentry+197/240] prune_one_dentry+0xc5/0xf0 [prune_dcache+322/512] prune_dcache+0x142/0x200 [shrink_dcache_memory+65/80] shrink_dcache_memory+0x41/0x50 [shrink_slab+274/480] shrink_slab+0x112/0x1e0 [kswapd+1232/1552] kswapd+0x4d0/0x610 [isolate_pages_global+0/64] ? isolate_pages_global+0x0/0x40 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [_spin_unlock_irqrestore+69/144] ? _spin_unlock_irqrestore+0x45/0x90 [kswapd+0/1552] ? kswapd+0x0/0x610 [kthread+73/144] kthread+0x49/0x90 [child_rip+10/18] child_rip+0xa/0x12 [restore_args+0/48] ? restore_args+0x0/0x30 [kthread+0/144] ? kthread+0x0/0x90 [child_rip+0/18] ? child_rip+0x0/0x12 Code: 95 49 81 e8 ab ff 21 00 5b 41 5c c9 c3 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 87 b8 00 00 00 48 85 c0 74 0b <48> 8b 40 20 48 85 c0 74 02 ff d0 48 83 7b 50 00 74 1e 48 8d bb RIP [d_free+24/128] d_free+0x18/0x80 RSP <ffff81007ced9cf0> CR2: 0000000000000110 ---[ end trace ca143223eefdc828 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 21:18 ` Jiri Slaby @ 2008-04-21 21:58 ` Jiri Slaby 2008-04-21 22:26 ` Jiri Slaby 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 21:58 UTC (permalink / raw) To: David Miller Cc: torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, Zdenek Kabelac Leaving untouched. On 04/21/2008 11:18 PM, Jiri Slaby wrote: > On 04/21/2008 10:39 PM, David Miller wrote: >> From: Linus Torvalds <torvalds@linux-foundation.org> >> Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) >> >>> What I find interesting is that at least for me, I have the SLAB >>> bucket size for nf_conntrack_expect being 208 bytes. And the >>> *biggest* merge by far after 2.6.25 so far has been networking (and >>> conntrack in particular) >>> >>> Is that a smoking gun? Not necessarily. But it *is* intriguing. But >>> there are other possible clashes (the 192-byte bucket has several >>> different suspects, and not all of them are in networking).1 >> >> I think you might be onto something here. >> >> The "mask" member of struct nf_conntrack_expect could be reasonably >> all 1's like the value reported in the crash that begins this >> thread. >> >> Do we know the offset within the object at which this all 1's >> value is found? >> >> My rough calculations show that on 32-bit that expect->mask member is >> at offset 56 and on 64-bit it should be at offset 72. Does that >> match up to the offset of the filp or whatever bit being corrupted? > > dentry.d_name.name is 56 on 64-bit (my memcmp crashes) > dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) > dentry.d_op is 136 (crash below) file.f_mapping is 176 (the another one from -rc8-mm2) the one at: http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/9008289.html Having slub_debug enabled, tomorrow will be results, I guess... > It's spreading :/. > > ---------- Forwarded message ---------- > From: Zdenek Kabelac <zdenek.kabelac@gmail.com> > Date: 21.4.2008 11:14 > Subject: BUG: unable to handle kernel NULL pointer at d_free+0x18/0x80 > To: Kernel development list <linux-kernel@vger.kernel.org> > > > Hello > > This oops appeared in my log - unsure how it is related to my DVB-T > tuner test before. > But I've also seen another weird resume with some similar crash. > > Happens with 2.6.25 - commit 48a86f548fb74928f9a466f52527e83fecdb4575 > (T61, 2GB) > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000110 > IP: [d_free+24/128] d_free+0x18/0x80 > PGD 0 > Oops: 0000 [1] PREEMPT SMP > CPU 0 > Modules linked in: usb_storage dvb_usb_af9015 dvb_usb_dibusb_common > dib3000mc dibx000_common dvb_usb dvb_core tun nls_iso8859_2 nls_cp852 > vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 > xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables > x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 > sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput > kvm_intel kvm snd_hda_intel arc4 ecb snd_seq_oss crypto_blkcipher > snd_seq_midi_event snd_seq cryptomgr snd_seq_device snd_pcm_oss > crypto_algapi iwl3945 mac80211 e1000e psmouse snd_mixer_oss rtc_cmos > evdev rtc_core thinkpad_acpi video snd_pcm mmc_block sdhci mmc_core > snd_timer iTCO_wdt iTCO_vendor_support battery backlight nvram rtc_lib > i2c_i801 i2c_core ac snd soundcore snd_page_alloc intel_agp output > serio_raw cfg80211 button uhci_hcd ohci_hcd ehci_hcd usbcore [last > unloaded: dvb_core] > Pid: 210, comm: kswapd0 Not tainted 2.6.25 #56 > RIP: 0010:[d_free+24/128] [d_free+24/128] d_free+0x18/0x80 > RSP: 0018:ffff81007ced9cf0 EFLAGS: 00010206 > RAX: 00000000000000f0 RBX: ffff8100202723d8 RCX: 0000000000000132 > RDX: 0000000000005e5d RSI: ffff81007ced4048 RDI: ffff8100202723d8 > RBP: ffff81007ced9d00 R08: 0000000000000002 R09: d37a6f4de9bd37a7 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100202723d8 > R13: ffff81007c9329d8 R14: ffff8100202723e0 R15: 0000000000000029 > FS: 0000000000000000(0000) GS:ffffffff81486000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000110 CR3: 0000000001001000 CR4: 00000000000026e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process kswapd0 (pid: 210, threadinfo ffff81007ced8000, task > ffff81007ced4000) > Stack: ffff8100202723d8 ffff81007b7e73d8 ffff81007ced9d20 > ffffffff810cb3eb > ffff8100202723d8 0000000000000000 ffff81007ced9d40 ffffffff810cb4d5 > ffff8100202723d8 ffff8100202723d8 ffff81007ced9d80 ffffffff810cb642 > Call Trace: > [d_kill+59/96] d_kill+0x3b/0x60 > [prune_one_dentry+197/240] prune_one_dentry+0xc5/0xf0 > [prune_dcache+322/512] prune_dcache+0x142/0x200 > [shrink_dcache_memory+65/80] shrink_dcache_memory+0x41/0x50 > [shrink_slab+274/480] shrink_slab+0x112/0x1e0 > [kswapd+1232/1552] kswapd+0x4d0/0x610 > [isolate_pages_global+0/64] ? isolate_pages_global+0x0/0x40 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [_spin_unlock_irqrestore+69/144] ? _spin_unlock_irqrestore+0x45/0x90 > [kswapd+0/1552] ? kswapd+0x0/0x610 > [kthread+73/144] kthread+0x49/0x90 > [child_rip+10/18] child_rip+0xa/0x12 > [restore_args+0/48] ? restore_args+0x0/0x30 > [kthread+0/144] ? kthread+0x0/0x90 > [child_rip+0/18] ? child_rip+0x0/0x12 > > > Code: 95 49 81 e8 ab ff 21 00 5b 41 5c c9 c3 66 0f 1f 44 00 00 55 48 > 89 e5 53 48 89 fb 48 83 ec 08 48 8b 87 b8 00 00 00 48 85 c0 74 0b <48> > 8b 40 20 48 85 c0 74 02 ff d0 48 83 7b 50 00 74 1e 48 8d bb > RIP [d_free+24/128] d_free+0x18/0x80 > RSP <ffff81007ced9cf0> > CR2: 0000000000000110 > ---[ end trace ca143223eefdc828 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 21:58 ` Jiri Slaby @ 2008-04-21 22:26 ` Jiri Slaby 2008-04-21 22:54 ` Paul E. McKenney 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 22:26 UTC (permalink / raw) To: David Miller Cc: torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, Zdenek Kabelac On 04/21/2008 11:58 PM, Jiri Slaby wrote: > Leaving untouched. > > On 04/21/2008 11:18 PM, Jiri Slaby wrote: >> On 04/21/2008 10:39 PM, David Miller wrote: >>> From: Linus Torvalds <torvalds@linux-foundation.org> >>> Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) >>> >>>> What I find interesting is that at least for me, I have the SLAB >>>> bucket size for nf_conntrack_expect being 208 bytes. And the >>>> *biggest* merge by far after 2.6.25 so far has been networking (and >>>> conntrack in particular) >>>> >>>> Is that a smoking gun? Not necessarily. But it *is* intriguing. But >>>> there are other possible clashes (the 192-byte bucket has several >>>> different suspects, and not all of them are in networking).1 >>> >>> I think you might be onto something here. >>> >>> The "mask" member of struct nf_conntrack_expect could be reasonably >>> all 1's like the value reported in the crash that begins this >>> thread. >>> >>> Do we know the offset within the object at which this all 1's >>> value is found? >>> >>> My rough calculations show that on 32-bit that expect->mask member is >>> at offset 56 and on 64-bit it should be at offset 72. Does that >>> match up to the offset of the filp or whatever bit being corrupted? >> >> dentry.d_name.name is 56 on 64-bit (my memcmp crashes) >> dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) >> dentry.d_op is 136 (crash below) > > file.f_mapping is 176 (the another one from -rc8-mm2) > > the one at: > http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/9008289.html > > > Having slub_debug enabled, tomorrow will be results, I guess... Sorry, one more entry: 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) 00f0000000000000 dentry.d_hash.next (me, offset 24) ffff81f02003f16c dentry.d_name.name (me, offset 56) memory ORed by 000000f000000000 fffff0002004c1b0 file.f_mapping (me, offset 176) memory hole, it was something like (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) -1, ~0ULL What are these nibble plays? >> It's spreading :/. >> >> ---------- Forwarded message ---------- >> From: Zdenek Kabelac <zdenek.kabelac@gmail.com> >> Date: 21.4.2008 11:14 >> Subject: BUG: unable to handle kernel NULL pointer at d_free+0x18/0x80 >> To: Kernel development list <linux-kernel@vger.kernel.org> >> >> >> Hello >> >> This oops appeared in my log - unsure how it is related to my DVB-T >> tuner test before. >> But I've also seen another weird resume with some similar crash. >> >> Happens with 2.6.25 - commit 48a86f548fb74928f9a466f52527e83fecdb4575 >> (T61, 2GB) >> >> >> BUG: unable to handle kernel NULL pointer dereference at >> 0000000000000110 >> IP: [d_free+24/128] d_free+0x18/0x80 >> PGD 0 >> Oops: 0000 [1] PREEMPT SMP >> CPU 0 >> Modules linked in: usb_storage dvb_usb_af9015 dvb_usb_dibusb_common >> dib3000mc dibx000_common dvb_usb dvb_core tun nls_iso8859_2 nls_cp852 >> vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 >> xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables >> x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 >> sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput >> kvm_intel kvm snd_hda_intel arc4 ecb snd_seq_oss crypto_blkcipher >> snd_seq_midi_event snd_seq cryptomgr snd_seq_device snd_pcm_oss >> crypto_algapi iwl3945 mac80211 e1000e psmouse snd_mixer_oss rtc_cmos >> evdev rtc_core thinkpad_acpi video snd_pcm mmc_block sdhci mmc_core >> snd_timer iTCO_wdt iTCO_vendor_support battery backlight nvram rtc_lib >> i2c_i801 i2c_core ac snd soundcore snd_page_alloc intel_agp output >> serio_raw cfg80211 button uhci_hcd ohci_hcd ehci_hcd usbcore [last >> unloaded: dvb_core] >> Pid: 210, comm: kswapd0 Not tainted 2.6.25 #56 >> RIP: 0010:[d_free+24/128] [d_free+24/128] d_free+0x18/0x80 >> RSP: 0018:ffff81007ced9cf0 EFLAGS: 00010206 >> RAX: 00000000000000f0 RBX: ffff8100202723d8 RCX: 0000000000000132 >> RDX: 0000000000005e5d RSI: ffff81007ced4048 RDI: ffff8100202723d8 >> RBP: ffff81007ced9d00 R08: 0000000000000002 R09: d37a6f4de9bd37a7 >> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100202723d8 >> R13: ffff81007c9329d8 R14: ffff8100202723e0 R15: 0000000000000029 >> FS: 0000000000000000(0000) GS:ffffffff81486000(0000) >> knlGS:0000000000000000 >> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> CR2: 0000000000000110 CR3: 0000000001001000 CR4: 00000000000026e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> Process kswapd0 (pid: 210, threadinfo ffff81007ced8000, task >> ffff81007ced4000) >> Stack: ffff8100202723d8 ffff81007b7e73d8 ffff81007ced9d20 >> ffffffff810cb3eb >> ffff8100202723d8 0000000000000000 ffff81007ced9d40 ffffffff810cb4d5 >> ffff8100202723d8 ffff8100202723d8 ffff81007ced9d80 ffffffff810cb642 >> Call Trace: >> [d_kill+59/96] d_kill+0x3b/0x60 >> [prune_one_dentry+197/240] prune_one_dentry+0xc5/0xf0 >> [prune_dcache+322/512] prune_dcache+0x142/0x200 >> [shrink_dcache_memory+65/80] shrink_dcache_memory+0x41/0x50 >> [shrink_slab+274/480] shrink_slab+0x112/0x1e0 >> [kswapd+1232/1552] kswapd+0x4d0/0x610 >> [isolate_pages_global+0/64] ? isolate_pages_global+0x0/0x40 >> [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 >> [_spin_unlock_irqrestore+69/144] ? _spin_unlock_irqrestore+0x45/0x90 >> [kswapd+0/1552] ? kswapd+0x0/0x610 >> [kthread+73/144] kthread+0x49/0x90 >> [child_rip+10/18] child_rip+0xa/0x12 >> [restore_args+0/48] ? restore_args+0x0/0x30 >> [kthread+0/144] ? kthread+0x0/0x90 >> [child_rip+0/18] ? child_rip+0x0/0x12 >> >> >> Code: 95 49 81 e8 ab ff 21 00 5b 41 5c c9 c3 66 0f 1f 44 00 00 55 48 >> 89 e5 53 48 89 fb 48 83 ec 08 48 8b 87 b8 00 00 00 48 85 c0 74 0b <48> >> 8b 40 20 48 85 c0 74 02 ff d0 48 83 7b 50 00 74 1e 48 8d bb >> RIP [d_free+24/128] d_free+0x18/0x80 >> RSP <ffff81007ced9cf0> >> CR2: 0000000000000110 >> ---[ end trace ca143223eefdc828 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 22:26 ` Jiri Slaby @ 2008-04-21 22:54 ` Paul E. McKenney 2008-04-21 23:02 ` Jiri Slaby 2008-04-22 1:15 ` Rafael J. Wysocki 0 siblings, 2 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-21 22:54 UTC (permalink / raw) To: Jiri Slaby Cc: David Miller, torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: > On 04/21/2008 11:58 PM, Jiri Slaby wrote: > >Leaving untouched. > > > >On 04/21/2008 11:18 PM, Jiri Slaby wrote: > >>On 04/21/2008 10:39 PM, David Miller wrote: > >>>From: Linus Torvalds <torvalds@linux-foundation.org> > >>>Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) > >>> > >>>>What I find interesting is that at least for me, I have the SLAB > >>>>bucket size for nf_conntrack_expect being 208 bytes. And the > >>>>*biggest* merge by far after 2.6.25 so far has been networking (and > >>>>conntrack in particular) > >>>> > >>>>Is that a smoking gun? Not necessarily. But it *is* intriguing. But > >>>>there are other possible clashes (the 192-byte bucket has several > >>>>different suspects, and not all of them are in networking).1 > >>> > >>>I think you might be onto something here. > >>> > >>>The "mask" member of struct nf_conntrack_expect could be reasonably > >>>all 1's like the value reported in the crash that begins this > >>>thread. > >>> > >>>Do we know the offset within the object at which this all 1's > >>>value is found? > >>> > >>>My rough calculations show that on 32-bit that expect->mask member is > >>>at offset 56 and on 64-bit it should be at offset 72. Does that > >>>match up to the offset of the filp or whatever bit being corrupted? > >> > >>dentry.d_name.name is 56 on 64-bit (my memcmp crashes) > >>dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) > >>dentry.d_op is 136 (crash below) > > > >file.f_mapping is 176 (the another one from -rc8-mm2) > > > >the one at: > >http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/9008289.html > > > > > >Having slub_debug enabled, tomorrow will be results, I guess... > > Sorry, one more entry: > > 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > 00f0000000000000 dentry.d_hash.next (me, offset 24) > ffff81f02003f16c dentry.d_name.name (me, offset 56) > memory ORed by 000000f000000000 > fffff0002004c1b0 file.f_mapping (me, offset 176) > memory hole, it was something like > (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? > ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) > -1, ~0ULL Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but there are a couple of patches that need to move from -rt to mainline, but mostly related to SELinux. So if both PREEMPT_RCU and SELinux were in use, we might be missing "rcu-various-fixups.patch" from: http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.24.4-rt4-broken-out.tar.bz2 Thanx, Paul > What are these nibble plays? > > >>It's spreading :/. > >> > >>---------- Forwarded message ---------- > >>From: Zdenek Kabelac <zdenek.kabelac@gmail.com> > >>Date: 21.4.2008 11:14 > >>Subject: BUG: unable to handle kernel NULL pointer at d_free+0x18/0x80 > >>To: Kernel development list <linux-kernel@vger.kernel.org> > >> > >> > >>Hello > >> > >> This oops appeared in my log - unsure how it is related to my DVB-T > >> tuner test before. > >> But I've also seen another weird resume with some similar crash. > >> > >> Happens with 2.6.25 - commit 48a86f548fb74928f9a466f52527e83fecdb4575 > >> (T61, 2GB) > >> > >> > >> BUG: unable to handle kernel NULL pointer dereference at > >>0000000000000110 > >> IP: [d_free+24/128] d_free+0x18/0x80 > >> PGD 0 > >> Oops: 0000 [1] PREEMPT SMP > >> CPU 0 > >> Modules linked in: usb_storage dvb_usb_af9015 dvb_usb_dibusb_common > >> dib3000mc dibx000_common dvb_usb dvb_core tun nls_iso8859_2 nls_cp852 > >> vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 > >> xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables > >> x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 > >> sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput > >> kvm_intel kvm snd_hda_intel arc4 ecb snd_seq_oss crypto_blkcipher > >> snd_seq_midi_event snd_seq cryptomgr snd_seq_device snd_pcm_oss > >> crypto_algapi iwl3945 mac80211 e1000e psmouse snd_mixer_oss rtc_cmos > >> evdev rtc_core thinkpad_acpi video snd_pcm mmc_block sdhci mmc_core > >> snd_timer iTCO_wdt iTCO_vendor_support battery backlight nvram rtc_lib > >> i2c_i801 i2c_core ac snd soundcore snd_page_alloc intel_agp output > >> serio_raw cfg80211 button uhci_hcd ohci_hcd ehci_hcd usbcore [last > >> unloaded: dvb_core] > >> Pid: 210, comm: kswapd0 Not tainted 2.6.25 #56 > >> RIP: 0010:[d_free+24/128] [d_free+24/128] d_free+0x18/0x80 > >> RSP: 0018:ffff81007ced9cf0 EFLAGS: 00010206 > >> RAX: 00000000000000f0 RBX: ffff8100202723d8 RCX: 0000000000000132 > >> RDX: 0000000000005e5d RSI: ffff81007ced4048 RDI: ffff8100202723d8 > >> RBP: ffff81007ced9d00 R08: 0000000000000002 R09: d37a6f4de9bd37a7 > >> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100202723d8 > >> R13: ffff81007c9329d8 R14: ffff8100202723e0 R15: 0000000000000029 > >> FS: 0000000000000000(0000) GS:ffffffff81486000(0000) > >>knlGS:0000000000000000 > >> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > >> CR2: 0000000000000110 CR3: 0000000001001000 CR4: 00000000000026e0 > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >> Process kswapd0 (pid: 210, threadinfo ffff81007ced8000, task > >>ffff81007ced4000) > >> Stack: ffff8100202723d8 ffff81007b7e73d8 ffff81007ced9d20 > >>ffffffff810cb3eb > >> ffff8100202723d8 0000000000000000 ffff81007ced9d40 ffffffff810cb4d5 > >> ffff8100202723d8 ffff8100202723d8 ffff81007ced9d80 ffffffff810cb642 > >> Call Trace: > >> [d_kill+59/96] d_kill+0x3b/0x60 > >> [prune_one_dentry+197/240] prune_one_dentry+0xc5/0xf0 > >> [prune_dcache+322/512] prune_dcache+0x142/0x200 > >> [shrink_dcache_memory+65/80] shrink_dcache_memory+0x41/0x50 > >> [shrink_slab+274/480] shrink_slab+0x112/0x1e0 > >> [kswapd+1232/1552] kswapd+0x4d0/0x610 > >> [isolate_pages_global+0/64] ? isolate_pages_global+0x0/0x40 > >> [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > >> [_spin_unlock_irqrestore+69/144] ? _spin_unlock_irqrestore+0x45/0x90 > >> [kswapd+0/1552] ? kswapd+0x0/0x610 > >> [kthread+73/144] kthread+0x49/0x90 > >> [child_rip+10/18] child_rip+0xa/0x12 > >> [restore_args+0/48] ? restore_args+0x0/0x30 > >> [kthread+0/144] ? kthread+0x0/0x90 > >> [child_rip+0/18] ? child_rip+0x0/0x12 > >> > >> > >> Code: 95 49 81 e8 ab ff 21 00 5b 41 5c c9 c3 66 0f 1f 44 00 00 55 48 > >> 89 e5 53 48 89 fb 48 83 ec 08 48 8b 87 b8 00 00 00 48 85 c0 74 0b <48> > >> 8b 40 20 48 85 c0 74 02 ff d0 48 83 7b 50 00 74 1e 48 8d bb > >> RIP [d_free+24/128] d_free+0x18/0x80 > >> RSP <ffff81007ced9cf0> > >> CR2: 0000000000000110 > >> ---[ end trace ca143223eefdc828 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 22:54 ` Paul E. McKenney @ 2008-04-21 23:02 ` Jiri Slaby 2008-04-21 23:11 ` Zdenek Kabelac 2008-04-21 23:17 ` Jiri Slaby 2008-04-22 1:15 ` Rafael J. Wysocki 1 sibling, 2 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 23:02 UTC (permalink / raw) To: paulmck Cc: David Miller, torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On 04/22/2008 12:54 AM, Paul E. McKenney wrote: > On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: >>> Having slub_debug enabled, tomorrow will be results, I guess... >> Sorry, one more entry: >> >> 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) Zdenek's is at offset 184. >> 00f0000000000000 dentry.d_hash.next (me, offset 24) >> ffff81f02003f16c dentry.d_name.name (me, offset 56) >> memory ORed by 000000f000000000 >> fffff0002004c1b0 file.f_mapping (me, offset 176) >> memory hole, it was something like >> (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? >> ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) >> -1, ~0ULL > > Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but > there are a couple of patches that need to move from -rt to mainline, > but mostly related to SELinux. So if both PREEMPT_RCU and SELinux > were in use, we might be missing "rcu-various-fixups.patch" from: $ grep RCU .config CONFIG_CLASSIC_RCU=y # CONFIG_RCU_TORTURE_TEST is not set $ grep SECU .config # CONFIG_EXT4DEV_FS_SECURITY is not set # CONFIG_SECURITY is not set # CONFIG_SECURITY_FILE_CAPABILITIES is not set I guess not. BTW the corruption I mentioned earlier was char 'ð' and it's ('p' | 0xf0) in latin2. I think it was set_ðending_irq IIRC. Whatever, it won't help us. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 23:02 ` Jiri Slaby @ 2008-04-21 23:11 ` Zdenek Kabelac 2008-04-21 23:17 ` Jiri Slaby 1 sibling, 0 replies; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-21 23:11 UTC (permalink / raw) To: Jiri Slaby Cc: paulmck, David Miller, torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert 2008/4/22, Jiri Slaby <jirislaby@gmail.com>: > On 04/22/2008 12:54 AM, Paul E. McKenney wrote: > > > On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: > > > > > > > > > Having slub_debug enabled, tomorrow will be results, I guess... > > > > > > > Sorry, one more entry: > > > > > > 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > > > > > > > Zdenek's is at offset 184. > > > > > > > 00f0000000000000 dentry.d_hash.next (me, offset 24) > > > ffff81f02003f16c dentry.d_name.name (me, offset 56) > > > memory ORed by 000000f000000000 > > > fffff0002004c1b0 file.f_mapping (me, offset 176) > > > memory hole, it was something like > > > (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? > > > ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) > > > -1, ~0ULL > > > > > > > Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but > > there are a couple of patches that need to move from -rt to mainline, > > but mostly related to SELinux. So if both PREEMPT_RCU and SELinux > > were in use, we might be missing "rcu-various-fixups.patch" from: > > > > $ grep RCU .config > CONFIG_CLASSIC_RCU=y > # CONFIG_RCU_TORTURE_TEST is not set > $ grep SECU .config > # CONFIG_EXT4DEV_FS_SECURITY is not set > # CONFIG_SECURITY is not set > # CONFIG_SECURITY_FILE_CAPABILITIES is not set > > I guess not. > > BTW the corruption I mentioned earlier was char 'ð' and it's ('p' | 0xf0) > in latin2. I think it was set_ðending_irq IIRC. Whatever, it won't help us. > I've kernel compiled with preemptible RCU & Security - but usually using selinux=off as a kernel parameter Zdenek ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 23:02 ` Jiri Slaby 2008-04-21 23:11 ` Zdenek Kabelac @ 2008-04-21 23:17 ` Jiri Slaby 2008-04-22 0:54 ` Rafael J. Wysocki 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-21 23:17 UTC (permalink / raw) To: paulmck Cc: David Miller, torvalds, rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On 04/22/2008 01:02 AM, Jiri Slaby wrote: > On 04/22/2008 12:54 AM, Paul E. McKenney wrote: >> On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: >>>> Having slub_debug enabled, tomorrow will be results, I guess... OK, methinks it's tomorrow yet, at least here. >>> Sorry, one more entry: >>> >>> 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > > Zdenek's is at offset 184. > >>> 00f0000000000000 dentry.d_hash.next (me, offset 24) >>> ffff81f02003f16c dentry.d_name.name (me, offset 56) >>> memory ORed by 000000f000000000 >>> fffff0002004c1b0 file.f_mapping (me, offset 176) >>> memory hole, it was something like >>> (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? >>> ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) >>> -1, ~0ULL The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll give slab a try. Any other clues? Is this enough? $ grep SLUB ../my_64/.config CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set $ cat /proc/cmdline root=/dev/md1 vga=1 ro reboot=a,w slub_debug BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<ffffffff802aca27>] __d_lookup+0x97/0x160 PGD 4510b067 PUD 6768d067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/virtual/net/tun0/statistics/collisions CPU 0 Modules linked in: test ipv6 tun bitrev arc4 ecb crypto_blkcipher cryptomgr crypto_algapi ath5k mac80211 rtc_cmos crc32 sr_mod usbhid ohci1394 ehci_hcd rtc_core hid ieee1394 floppy cdrom cfg80211 rtc_lib evdev ff_memless Pid: 18600, comm: git-status Not tainted 2.6.25-mm1_64 #403 RIP: 0010:[<ffffffff802aca27>] [<ffffffff802aca27>] __d_lookup+0x97/0x160 RSP: 0018:ffff81006096bbf8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000012 RDX: ffff8100200f3568 RSI: ffff81006096bd08 RDI: ffff810020c0c880 RBP: ffff81006096bc58 R08: ffff81006096bd08 R09: 000000000000002c R10: 000000000000002d R11: ffff81006428c200 R12: ffff810021f0a770 R13: 000000001b820c0e R14: ffff810020c0c880 R15: ffff81006096bc28 FS: 00007f2aa905e710(0000) GS:ffffffff80664000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000001 CR3: 0000000008fba000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process git-status (pid: 18600, threadinfo ffff81006096a000, task ffff810007988fc0) Stack: ffff81006096bd08 0000000000000009 ffff810020c0c888 000000098026c2fd ffff81006428c21c 0000000000000000 0000000000000001 0000000000000001 ffff81006096be38 ffff81006096be38 ffff81006096bd08 ffff81006096bd18 Call Trace: [<ffffffff802a1e85>] do_lookup+0x35/0x220 [<ffffffff802ad3b8>] ? dput+0x38/0x180 [<ffffffff802a22c2>] __link_path_walk+0x252/0x1010 [<ffffffff802911d0>] ? init_object+0x50/0x90 [<ffffffff802a30ee>] path_walk+0x6e/0xe0 [<ffffffff802a33b2>] do_path_lookup+0xa2/0x240 [<ffffffff802a434c>] __user_walk_fd+0x4c/0x80 [<ffffffff8029ba0b>] vfs_lstat_fd+0x2b/0x70 [<ffffffff8029bbe3>] ? cp_new_stat+0xe3/0xf0 [<ffffffff8029bc97>] sys_newlstat+0x27/0x50 [<ffffffff8020b91b>] system_call_after_swapgs+0x7b/0x80 Code: 48 89 c3 48 8b 55 d0 8b 45 bc 48 85 d2 48 89 45 a8 75 18 eb 5f 0f 1f 80 00 00 00 00 48 8b 1b 48 89 5d d0 49 8b 07 48 85 c0 74 49 <48> 8b 03 4c 8d 63 e8 0f 18 08 45 39 6c 24 30 75 e0 4d 39 74 24 RIP [<ffffffff802aca27>] __d_lookup+0x97/0x160 RSP <ffff81006096bbf8> CR2: 0000000000000001 ---[ end trace f6b7fa8dcbc7b8f7 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 23:17 ` Jiri Slaby @ 2008-04-22 0:54 ` Rafael J. Wysocki 2008-04-22 1:14 ` Linus Torvalds 0 siblings, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 0:54 UTC (permalink / raw) To: Jiri Slaby Cc: paulmck, David Miller, torvalds, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tuesday, 22 of April 2008, Jiri Slaby wrote: > On 04/22/2008 01:02 AM, Jiri Slaby wrote: > > On 04/22/2008 12:54 AM, Paul E. McKenney wrote: > >> On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: > >>>> Having slub_debug enabled, tomorrow will be results, I guess... > > OK, methinks it's tomorrow yet, at least here. > > >>> Sorry, one more entry: > >>> > >>> 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > > > > Zdenek's is at offset 184. > > > >>> 00f0000000000000 dentry.d_hash.next (me, offset 24) > >>> ffff81f02003f16c dentry.d_name.name (me, offset 56) > >>> memory ORed by 000000f000000000 > >>> fffff0002004c1b0 file.f_mapping (me, offset 176) > >>> memory hole, it was something like > >>> (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? > >>> ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) > >>> -1, ~0ULL > > The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll > give slab a try. Any other clues? Well, SLUB uses some per CPU data structures. Is it possible that they get corrupted and which leads to the observed symptoms? ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 0:54 ` Rafael J. Wysocki @ 2008-04-22 1:14 ` Linus Torvalds 2008-04-22 1:30 ` Rafael J. Wysocki 2008-04-22 9:49 ` Jiri Slaby 0 siblings, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-22 1:14 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Jiri Slaby, paulmck, David Miller, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > > > The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll > > give slab a try. Any other clues? > > Well, SLUB uses some per CPU data structures. Is it possible that they get > corrupted and which leads to the observed symptoms? It really doesn't look like the slub allocations themselves would be corrupted. It very much looks like wild pointers corrupting allocations that themselves were fine. The nybble pattern looked intriguing (especially as it apparently also hit a normal page cache page!) but obviously not everything matches that pattern (eg your value of 1). What do you do to trigger this? Any particular load? Is it still just doing suspend/resume, or do you have something else that you are playing with? Also, have you tried CONFIG_DEBUG_PAGEALLOC? That can also be a very powerful way to find memory corruption. Does anybody see any other patterns? Looking at the modules linked in in the oopses from Zdenek, Rafael and Jiri, I don't see anything odd. You both all have 80211 support, maybe the corruption comes from the wireless layer? Or maybe it's the x86 code changes themselves, and it really is about the suspend/resume sequence itself. Are all the people who see this doing suspends? Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 1:14 ` Linus Torvalds @ 2008-04-22 1:30 ` Rafael J. Wysocki 2008-04-22 9:49 ` Jiri Slaby 1 sibling, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 1:30 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, paulmck, David Miller, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > > > > > The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll > > > give slab a try. Any other clues? > > > > Well, SLUB uses some per CPU data structures. Is it possible that they get > > corrupted and which leads to the observed symptoms? > > It really doesn't look like the slub allocations themselves would be > corrupted. It very much looks like wild pointers corrupting allocations > that themselves were fine. > > The nybble pattern looked intriguing (especially as it apparently also hit > a normal page cache page!) but obviously not everything matches that > pattern (eg your value of 1). > > What do you do to trigger this? Any particular load? Is it still just > doing suspend/resume, or do you have something else that you are playing > with? I've seen that only once, so far. Jiri seems to be able to trigger it more often. > Also, have you tried CONFIG_DEBUG_PAGEALLOC? That can also be a very > powerful way to find memory corruption. I always have CONFIG_DEBUG_PAGEALLOC set. > Does anybody see any other patterns? Looking at the modules linked in in > the oopses from Zdenek, Rafael and Jiri, I don't see anything odd. You > both all have 80211 support, maybe the corruption comes from the wireless > layer? Well, I thought about that too. However, I had a hang before 2.6.25-git2 that I suspect was related (I couldn't get any information from the box, as it just hung solid), so I'd rather suspect some x86 changes. > Or maybe it's the x86 code changes themselves, and it really is about the > suspend/resume sequence itself. It seems to be specific to x86-64, AFAICS. > Are all the people who see this doing suspends? I'm not sure. Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 1:14 ` Linus Torvalds 2008-04-22 1:30 ` Rafael J. Wysocki @ 2008-04-22 9:49 ` Jiri Slaby 2008-04-22 9:53 ` Ingo Molnar 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-22 9:49 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, paulmck, David Miller, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac, mingo Linus Torvalds napsal(a): > > On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: >>> The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll >>> give slab a try. Any other clues? >> Well, SLUB uses some per CPU data structures. Is it possible that they get >> corrupted and which leads to the observed symptoms? > > It really doesn't look like the slub allocations themselves would be > corrupted. It very much looks like wild pointers corrupting allocations > that themselves were fine. Hmm, correct. > What do you do to trigger this? Any particular load? Is it still just > doing suspend/resume, or do you have something else that you are playing > with? Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran git-status for a fraction of a second until it was killed. So I can perfectly reproduce it when I suspend, resume and produce some io load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to reproduce it the best and haven't seen that bug in -rc8-mm1 for over week of suspending and working. > Also, have you tried CONFIG_DEBUG_PAGEALLOC? That can also be a very > powerful way to find memory corruption. Not yet. > Does anybody see any other patterns? Looking at the modules linked in in > the oopses from Zdenek, Rafael and Jiri, I don't see anything odd. You > both all have 80211 support, maybe the corruption comes from the wireless > layer? May be, however I don't use that stack, it's a desktop machine, it's only sitting there not turned on, but sure, it's loaded. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 9:49 ` Jiri Slaby @ 2008-04-22 9:53 ` Ingo Molnar 2008-04-22 18:35 ` Zdenek Kabelac 2008-04-22 19:09 ` Ingo Molnar 0 siblings, 2 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-22 9:53 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, Rafael J. Wysocki, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert, Zdenek Kabelac * Jiri Slaby <jirislaby@gmail.com> wrote: >> What do you do to trigger this? Any particular load? Is it still just >> doing suspend/resume, or do you have something else that you are >> playing with? > > Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran > git-status for a fraction of a second until it was killed. So I can > perfectly reproduce it when I suspend, resume and produce some io > load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to > reproduce it the best and haven't seen that bug in -rc8-mm1 for over > week of suspending and working. the most dangerous x86 change we added was the PAT stuff. Does it influence the crashes in any way if you boot with 'nopat' or if you disable CONFIG_X86_PAT=y into the .config? the other area was the DMA ops change - that should be rather trivial on 64-bit though. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 9:53 ` Ingo Molnar @ 2008-04-22 18:35 ` Zdenek Kabelac 2008-04-22 18:48 ` Linus Torvalds 2008-04-22 21:46 ` Rafael J. Wysocki 2008-04-22 19:09 ` Ingo Molnar 1 sibling, 2 replies; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-22 18:35 UTC (permalink / raw) To: Ingo Molnar Cc: Jiri Slaby, Linus Torvalds, Rafael J. Wysocki, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert 2008/4/22, Ingo Molnar <mingo@elte.hu>: > > * Jiri Slaby <jirislaby@gmail.com> wrote: > > >> What do you do to trigger this? Any particular load? Is it still just > >> doing suspend/resume, or do you have something else that you are > >> playing with? > > > > Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran > > git-status for a fraction of a second until it was killed. So I can > > perfectly reproduce it when I suspend, resume and produce some io > > load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to > > reproduce it the best and haven't seen that bug in -rc8-mm1 for over > > week of suspending and working. > > > the most dangerous x86 change we added was the PAT stuff. Does it > influence the crashes in any way if you boot with 'nopat' or if you > disable CONFIG_X86_PAT=y into the .config? > > the other area was the DMA ops change - that should be rather trivial on > 64-bit though. Unsure how it is related to my orginal Oops post - but now when I've debug pagealloc enabled this appeared in my log after resume - should I open new bug for this - or could this be part of the problem I've experienced later? (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 sd 0:0:0:0: [sda] Starting disk mmc0: new SD card at address 5a61 mmc mmc0:5a61: parent mmc0 is sleeping, will not add ------------[ cut here ]------------ WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() Modules linked in: tda18271 nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel snd_seq_oss snd_seq_midi_event snd_seq arc4 snd_seq_device snd_pcm_oss ecb crypto_blkcipher cryptomgr crypto_algapi iwl3945 snd_mixer_oss mac80211 snd_pcm mmc_block video sdhci thinkpad_acpi mmc_core i2c_i801 snd_timer rtc_cmos rtc_core backlight iTCO_wdt cfg80211 evdev snd i2c_core e1000e psmouse soundcore snd_page_alloc nvram intel_agp rtc_lib iTCO_vendor_support output serio_raw ac battery button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 1240, comm: kmmcd Not tainted 2.6.25 #57 Call Trace: [warn_on_slowpath+95/144] warn_on_slowpath+0x5f/0x90 [device_pm_add+24/240] ? device_pm_add+0x18/0xf0 [device_pm_add+108/240] device_pm_add+0x6c/0xf0 [device_add+1092/1376] device_add+0x444/0x560 [_end+510110570/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 [_end+510117927/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 [_end+510109176/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 [_end+510109545/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 [run_workqueue+246/560] run_workqueue+0xf6/0x230 [worker_thread+167/288] worker_thread+0xa7/0x120 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [worker_thread+0/288] ? worker_thread+0x0/0x120 [kthread+73/144] kthread+0x49/0x90 [child_rip+10/18] child_rip+0xa/0x12 [restore_args+0/48] ? restore_args+0x0/0x30 [kthread+0/144] ? kthread+0x0/0x90 [child_rip+0/18] ? child_rip+0x0/0x12 ---[ end trace ca143223eefdc828 ]--- BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 IP: [klist_del+29/128] klist_del+0x1d/0x80 PGD 0 Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: tda18271 nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel snd_seq_oss snd_seq_midi_event snd_seq arc4 snd_seq_device snd_pcm_oss ecb crypto_blkcipher cryptomgr crypto_algapi iwl3945 snd_mixer_oss mac80211 snd_pcm mmc_block video sdhci thinkpad_acpi mmc_core i2c_i801 snd_timer rtc_cmos rtc_core backlight iTCO_wdt cfg80211 evdev snd i2c_core e1000e psmouse soundcore snd_page_alloc nvram intel_agp rtc_lib iTCO_vendor_support output serio_raw ac battery button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 1240, comm: kmmcd Not tainted 2.6.25 #57 RIP: 0010:[klist_del+29/128] [klist_del+29/128] klist_del+0x1d/0x80 RSP: 0000:ffff81007cabbd00 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003 RDX: 0000000000000008 RSI: ffffffffa0102308 RDI: 0000000000000000 RBP: ffff81007cabbd20 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffff81007c9a6d10 R12: ffff81007c517530 R13: ffffffffa0102260 R14: ffff81007cabbdf0 R15: ffff81007c5175a8 FS: 0000000000000000(0000) GS:ffffffff8148c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000050 CR3: 0000000001001000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kmmcd (pid: 1240, threadinfo ffff81007caba000, task ffff81007cac0000) Stack: ffff81007cabbd10 0000000000000050 ffff81007c5173f8 ffffffffa0102260 ffff81007cabbd50 ffffffff812012fe ffff81007cabbd50 ffff81007c5173f8 00000000fffffff0 ffff81007c5175f0 ffff81007cabbdb0 ffffffff8120016e Call Trace: [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 [device_add+1358/1376] device_add+0x54e/0x560 [_end+510110570/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 hald[2531]: forcibly attempting to lazy unmount /dev/mmcblk0p1 as enclosing drive was disconnected [_end+510117927/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 [_end+510109176/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 [_end+510109545/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 [run_workqueue+246/560] run_workqueue+0xf6/0x230 [worker_thread+167/288] worker_thread+0xa7/0x120 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [worker_thread+0/288] ? worker_thread+0x0/0x120 [kthread+73/144] kthread+0x49/0x90 [child_rip+10/18] child_rip+0xa/0x12 [restore_args+0/48] ? restore_args+0x0/0x30 [kthread+0/144] ? kthread+0x0/0x90 [child_rip+0/18] ? child_rip+0x0/0x12 Code: 8b 28 41 0f 95 c7 eb 87 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 4c 89 65 f0 48 89 5d e8 4c 89 6d f8 49 89 fc 48 8b 1f 48 89 df <4c> 8b 6b 50 e8 9a 40 01 00 49 8d 7c 24 18 48 c7 c6 20 a4 2d 81 RIP [klist_del+29/128] klist_del+0x1d/0x80 RSP <ffff81007cabbd00> CR2: 0000000000000050 ---[ end trace ca143223eefdc828 ]--- ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 18:35 ` Zdenek Kabelac @ 2008-04-22 18:48 ` Linus Torvalds 2008-04-22 20:34 ` device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) Rafael J. Wysocki 2008-04-23 8:50 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Zdenek Kabelac 2008-04-22 21:46 ` Rafael J. Wysocki 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-22 18:48 UTC (permalink / raw) To: Zdenek Kabelac Cc: Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert On Tue, 22 Apr 2008, Zdenek Kabelac wrote: > > Unsure how it is related to my orginal Oops post - but now when I've > debug pagealloc enabled this appeared in my log after resume - should > I open new bug for this - or could this be part of the problem I've > experienced later? > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > sd 0:0:0:0: [sda] Starting disk > mmc0: new SD card at address 5a61 > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > ------------[ cut here ]------------ > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() This is unrelated to the other issue, I think. Your warning comes from commit 58aca23226a19983571bd3b65167521fc64f5869, which admittedly looks like total crap. Rafael, what's the point of that commit? I read the commit message, but I can't make myself agree with the commit code itself. If it's a "checking that the order is correct" thing, it should be a warning, but not change the actual _action_ of the code. Because the commit refused to add the device, it is also then the direct reason for the oops you get later, as far as I can tell: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > IP: [klist_del+29/128] klist_del+0x1d/0x80 > PGD 0 > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC > CPU 0 > Call Trace: > [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 > [device_add+1358/1376] device_add+0x54e/0x560 So I would suggest reverting that commit, or at least just making it a warning (while still registering the device). Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 18:48 ` Linus Torvalds @ 2008-04-22 20:34 ` Rafael J. Wysocki 2008-04-22 20:57 ` Rafael J. Wysocki 2008-04-22 20:58 ` Linus Torvalds 2008-04-23 8:50 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Zdenek Kabelac 1 sibling, 2 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 20:34 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, linux-kernel, akpm, herbert, Alan Stern, pm list On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > On Tue, 22 Apr 2008, Zdenek Kabelac wrote: > > > > Unsure how it is related to my orginal Oops post - but now when I've > > debug pagealloc enabled this appeared in my log after resume - should > > I open new bug for this - or could this be part of the problem I've > > experienced later? > > > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > > > sd 0:0:0:0: [sda] Starting disk > > mmc0: new SD card at address 5a61 > > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > > ------------[ cut here ]------------ > > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() > > This is unrelated to the other issue, I think. > > Your warning comes from commit 58aca23226a19983571bd3b65167521fc64f5869, > which admittedly looks like total crap. Well, I'm sorry that you think so. > Rafael, what's the point of that commit? More or less as stated in the changelog. If we register a child of a sleeping device, the child ends up on dpm_active before the parent, so the ordering will be wrong during the next suspend. That was discussed on linux-pm, mainly with Alan Stern. > I read the commit message, but I can't make myself agree with the commit > code itself. If it's a "checking that the order is correct" thing, it > should be a warning, but not change the actual _action_ of the code. That is easy to change. Please find appended a patch for that. > Because the commit refused to add the device, it is also then the direct > reason for the oops you get later, as far as I can tell: > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > > IP: [klist_del+29/128] klist_del+0x1d/0x80 > > PGD 0 > > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC > > CPU 0 > > Call Trace: > > [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 > > [device_add+1358/1376] device_add+0x54e/0x560 There is a bug in device_add() that IMO can be fixed this way: Index: linux-2.6/drivers/base/core.c =================================================================== --- linux-2.6.orig/drivers/base/core.c +++ linux-2.6/drivers/base/core.c @@ -820,11 +820,11 @@ int device_add(struct device *dev) error = bus_add_device(dev); if (error) goto BusError; + bus_attach_device(dev); error = device_pm_add(dev); if (error) goto PMError; kobject_uevent(&dev->kobj, KOBJ_ADD); - bus_attach_device(dev); if (parent) klist_add_tail(&dev->knode_parent, &parent->klist_children); The problem is that bus_remove_device() assumes bus_attach_device() to have run, AFAICS. > So I would suggest reverting that commit, or at least just making it a > warning (while still registering the device). Are drivers supposed to register children of suspended devices? That doesn't make much sense IMO ... Thanks, Rafael Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/base/power/main.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) Index: linux-2.6/drivers/base/power/main.c =================================================================== --- linux-2.6.orig/drivers/base/power/main.c +++ linux-2.6/drivers/base/power/main.c @@ -76,12 +76,10 @@ int device_pm_add(struct device *dev) else dev_warn(dev, "devices are sleeping, will not add\n"); WARN_ON(true); - error = -EBUSY; - } else { - error = dpm_sysfs_add(dev); - if (!error) - list_add_tail(&dev->power.entry, &dpm_active); } + error = dpm_sysfs_add(dev); + if (!error) + list_add_tail(&dev->power.entry, &dpm_active); mutex_unlock(&dpm_list_mtx); return error; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:34 ` device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) Rafael J. Wysocki @ 2008-04-22 20:57 ` Rafael J. Wysocki 2008-04-22 22:11 ` Greg KH 2008-04-22 20:58 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, linux-kernel, akpm, herbert, Alan Stern, pm list, Greg KH On Tuesday, 22 of April 2008, Rafael J. Wysocki wrote: > On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > > > On Tue, 22 Apr 2008, Zdenek Kabelac wrote: > > > > > > Unsure how it is related to my orginal Oops post - but now when I've > > > debug pagealloc enabled this appeared in my log after resume - should > > > I open new bug for this - or could this be part of the problem I've > > > experienced later? > > > > > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > > > > > sd 0:0:0:0: [sda] Starting disk > > > mmc0: new SD card at address 5a61 > > > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > > > ------------[ cut here ]------------ > > > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() > > > > This is unrelated to the other issue, I think. > > > > Your warning comes from commit 58aca23226a19983571bd3b65167521fc64f5869, > > which admittedly looks like total crap. > > Well, I'm sorry that you think so. > > > Rafael, what's the point of that commit? > > More or less as stated in the changelog. If we register a child of a sleeping > device, the child ends up on dpm_active before the parent, so the ordering will > be wrong during the next suspend. > > That was discussed on linux-pm, mainly with Alan Stern. > > > I read the commit message, but I can't make myself agree with the commit > > code itself. If it's a "checking that the order is correct" thing, it > > should be a warning, but not change the actual _action_ of the code. > > That is easy to change. Please find appended a patch for that. > > > Because the commit refused to add the device, it is also then the direct > > reason for the oops you get later, as far as I can tell: > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > > > IP: [klist_del+29/128] klist_del+0x1d/0x80 > > > PGD 0 > > > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC > > > CPU 0 > > > Call Trace: > > > [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 > > > [device_add+1358/1376] device_add+0x54e/0x560 > > There is a bug in device_add() that IMO can be fixed this way: > > Index: linux-2.6/drivers/base/core.c > =================================================================== > --- linux-2.6.orig/drivers/base/core.c > +++ linux-2.6/drivers/base/core.c > @@ -820,11 +820,11 @@ int device_add(struct device *dev) > error = bus_add_device(dev); > if (error) > goto BusError; > + bus_attach_device(dev); > error = device_pm_add(dev); > if (error) > goto PMError; > kobject_uevent(&dev->kobj, KOBJ_ADD); > - bus_attach_device(dev); > if (parent) > klist_add_tail(&dev->knode_parent, &parent->klist_children); > > The problem is that bus_remove_device() assumes bus_attach_device() to have > run, AFAICS. Hm, actually it's better to do this instead IMHO: --- Prevent bus_remove_device() from crashing if dev->knode_bus has not been initialized before it's called. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/base/bus.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/base/bus.c =================================================================== --- linux-2.6.orig/drivers/base/bus.c +++ linux-2.6/drivers/base/bus.c @@ -530,7 +530,8 @@ void bus_remove_device(struct device *de sysfs_remove_link(&dev->bus->p->devices_kset->kobj, dev->bus_id); device_remove_attrs(dev->bus, dev); - klist_del(&dev->knode_bus); + if (klist_node_attached(&dev->knode_bus)) + klist_del(&dev->knode_bus); pr_debug("bus: '%s': remove device %s\n", dev->bus->name, dev->bus_id); ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:57 ` Rafael J. Wysocki @ 2008-04-22 22:11 ` Greg KH 0 siblings, 0 replies; 183+ messages in thread From: Greg KH @ 2008-04-22 22:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, linux-kernel, akpm, herbert, Alan Stern, pm list On Tue, Apr 22, 2008 at 10:57:50PM +0200, Rafael J. Wysocki wrote: > On Tuesday, 22 of April 2008, Rafael J. Wysocki wrote: > > On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > > > > > On Tue, 22 Apr 2008, Zdenek Kabelac wrote: > > > > > > > > Unsure how it is related to my orginal Oops post - but now when I've > > > > debug pagealloc enabled this appeared in my log after resume - should > > > > I open new bug for this - or could this be part of the problem I've > > > > experienced later? > > > > > > > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > > > > > > > sd 0:0:0:0: [sda] Starting disk > > > > mmc0: new SD card at address 5a61 > > > > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > > > > ------------[ cut here ]------------ > > > > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() > > > > > > This is unrelated to the other issue, I think. > > > > > > Your warning comes from commit 58aca23226a19983571bd3b65167521fc64f5869, > > > which admittedly looks like total crap. > > > > Well, I'm sorry that you think so. > > > > > Rafael, what's the point of that commit? > > > > More or less as stated in the changelog. If we register a child of a sleeping > > device, the child ends up on dpm_active before the parent, so the ordering will > > be wrong during the next suspend. > > > > That was discussed on linux-pm, mainly with Alan Stern. > > > > > I read the commit message, but I can't make myself agree with the commit > > > code itself. If it's a "checking that the order is correct" thing, it > > > should be a warning, but not change the actual _action_ of the code. > > > > That is easy to change. Please find appended a patch for that. > > > > > Because the commit refused to add the device, it is also then the direct > > > reason for the oops you get later, as far as I can tell: > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > > > > IP: [klist_del+29/128] klist_del+0x1d/0x80 > > > > PGD 0 > > > > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC > > > > CPU 0 > > > > Call Trace: > > > > [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 > > > > [device_add+1358/1376] device_add+0x54e/0x560 > > > > There is a bug in device_add() that IMO can be fixed this way: > > > > Index: linux-2.6/drivers/base/core.c > > =================================================================== > > --- linux-2.6.orig/drivers/base/core.c > > +++ linux-2.6/drivers/base/core.c > > @@ -820,11 +820,11 @@ int device_add(struct device *dev) > > error = bus_add_device(dev); > > if (error) > > goto BusError; > > + bus_attach_device(dev); > > error = device_pm_add(dev); > > if (error) > > goto PMError; > > kobject_uevent(&dev->kobj, KOBJ_ADD); > > - bus_attach_device(dev); > > if (parent) > > klist_add_tail(&dev->knode_parent, &parent->klist_children); > > > > The problem is that bus_remove_device() assumes bus_attach_device() to have > > run, AFAICS. > > Hm, actually it's better to do this instead IMHO: > > --- > Prevent bus_remove_device() from crashing if dev->knode_bus has not been > initialized before it's called. > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> This fix looks like the correct one to me, but then again, I'm jet lagged, and don't know all of the context here. Moving where we call bus_attach_device() does not seem correct, as you are changing the logic of when we export the kobject, which might not be a good idea. Acked-by: Greg Kroah-Hartman <gregkh@suse.de> > --- > drivers/base/bus.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6/drivers/base/bus.c > =================================================================== > --- linux-2.6.orig/drivers/base/bus.c > +++ linux-2.6/drivers/base/bus.c > @@ -530,7 +530,8 @@ void bus_remove_device(struct device *de > sysfs_remove_link(&dev->bus->p->devices_kset->kobj, > dev->bus_id); > device_remove_attrs(dev->bus, dev); > - klist_del(&dev->knode_bus); > + if (klist_node_attached(&dev->knode_bus)) > + klist_del(&dev->knode_bus); > > pr_debug("bus: '%s': remove device %s\n", > dev->bus->name, dev->bus_id); ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:34 ` device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) Rafael J. Wysocki 2008-04-22 20:57 ` Rafael J. Wysocki @ 2008-04-22 20:58 ` Linus Torvalds 2008-04-22 22:12 ` Greg KH ` (2 more replies) 1 sibling, 3 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-22 20:58 UTC (permalink / raw) To: Rafael J. Wysocki, Greg KH Cc: Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, herbert, Alan Stern, pm list On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > There is a bug in device_add() that IMO can be fixed this way: Ok, looks fine. Greg? > Index: linux-2.6/drivers/base/core.c > =================================================================== > --- linux-2.6.orig/drivers/base/core.c > +++ linux-2.6/drivers/base/core.c > @@ -820,11 +820,11 @@ int device_add(struct device *dev) > error = bus_add_device(dev); > if (error) > goto BusError; > + bus_attach_device(dev); > error = device_pm_add(dev); > if (error) > goto PMError; > kobject_uevent(&dev->kobj, KOBJ_ADD); > - bus_attach_device(dev); > if (parent) > klist_add_tail(&dev->knode_parent, &parent->klist_children); > > The problem is that bus_remove_device() assumes bus_attach_device() to have > run, AFAICS. As to the other issue: > > So I would suggest reverting that commit, or at least just making it a > > warning (while still registering the device). > > Are drivers supposed to register children of suspended devices? That doesn't > make much sense IMO ... Well, that's why I think the warning itself makes sense - and then we can decide whether it makes sense for that particular case or not. Clearly it happens (since it triggered), now we need to figure out _why_ it happened. But I don't think debugging messages should change behaviour. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:58 ` Linus Torvalds @ 2008-04-22 22:12 ` Greg KH 2008-04-22 22:48 ` Rafael J. Wysocki 2008-04-23 0:50 ` Rafael J. Wysocki 2 siblings, 0 replies; 183+ messages in thread From: Greg KH @ 2008-04-22 22:12 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, herbert, Alan Stern, pm list On Tue, Apr 22, 2008 at 01:58:43PM -0700, Linus Torvalds wrote: > > > On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > > > There is a bug in device_add() that IMO can be fixed this way: > > Ok, looks fine. Greg? No, the other patch from Rafael looks "more correct", I'd prefer that one to go in. I'm about to be away from email for probably 24 hours or so, so if you could commit it, I'd appreciate it. thanks, greg k-h ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:58 ` Linus Torvalds 2008-04-22 22:12 ` Greg KH @ 2008-04-22 22:48 ` Rafael J. Wysocki 2008-04-23 0:50 ` Rafael J. Wysocki 2 siblings, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 22:48 UTC (permalink / raw) To: Linus Torvalds Cc: Greg KH, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, herbert, Alan Stern, pm list On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > > > There is a bug in device_add() that IMO can be fixed this way: > > Ok, looks fine. Greg? > > > Index: linux-2.6/drivers/base/core.c > > =================================================================== > > --- linux-2.6.orig/drivers/base/core.c > > +++ linux-2.6/drivers/base/core.c > > @@ -820,11 +820,11 @@ int device_add(struct device *dev) > > error = bus_add_device(dev); > > if (error) > > goto BusError; > > + bus_attach_device(dev); > > error = device_pm_add(dev); > > if (error) > > goto PMError; > > kobject_uevent(&dev->kobj, KOBJ_ADD); > > - bus_attach_device(dev); > > if (parent) > > klist_add_tail(&dev->knode_parent, &parent->klist_children); > > > > The problem is that bus_remove_device() assumes bus_attach_device() to have > > run, AFAICS. > > As to the other issue: > > > > So I would suggest reverting that commit, or at least just making it a > > > warning (while still registering the device). > > > > Are drivers supposed to register children of suspended devices? That doesn't > > make much sense IMO ... > > Well, that's why I think the warning itself makes sense - and then we can > decide whether it makes sense for that particular case or not. Clearly it > happens (since it triggered), now we need to figure out _why_ it happened. Well, this particular case looks like a race to me, but I'm waiting for the whole dmesg from Zdenek to verify that. > But I don't think debugging messages should change behaviour. Okay, below is a more complete patch that changes device_pm_add() so that it doesn't refuse to add children of suspended devices. Note, however, that even if such a registration succeeds, it will probably lead to some future problems. Thanks, Rafael --- Do not refuse to actually register children of suspended devices, but still warn about attempts to do that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/base/power/main.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) Index: linux-2.6/drivers/base/power/main.c =================================================================== --- linux-2.6.orig/drivers/base/power/main.c +++ linux-2.6/drivers/base/power/main.c @@ -62,7 +62,7 @@ static bool all_sleeping; */ int device_pm_add(struct device *dev) { - int error = 0; + int error; pr_debug("PM: Adding info for %s:%s\n", dev->bus ? dev->bus->name : "No Bus", @@ -70,18 +70,15 @@ int device_pm_add(struct device *dev) mutex_lock(&dpm_list_mtx); if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) { if (dev->parent->power.sleeping) - dev_warn(dev, - "parent %s is sleeping, will not add\n", + dev_warn(dev, "parent %s is sleeping\n", dev->parent->bus_id); else - dev_warn(dev, "devices are sleeping, will not add\n"); + dev_warn(dev, "all devices are sleeping\n"); WARN_ON(true); - error = -EBUSY; - } else { - error = dpm_sysfs_add(dev); - if (!error) - list_add_tail(&dev->power.entry, &dpm_active); } + error = dpm_sysfs_add(dev); + if (!error) + list_add_tail(&dev->power.entry, &dpm_active); mutex_unlock(&dpm_list_mtx); return error; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-22 20:58 ` Linus Torvalds 2008-04-22 22:12 ` Greg KH 2008-04-22 22:48 ` Rafael J. Wysocki @ 2008-04-23 0:50 ` Rafael J. Wysocki 2008-04-23 14:56 ` Alan Stern 2 siblings, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-23 0:50 UTC (permalink / raw) To: Linus Torvalds Cc: Greg KH, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, herbert, Alan Stern, pm list [Sorry for resending, but it looks like this message didn't reach the lists.] On Tuesday, 22 of April 2008, Linus Torvalds wrote: > > On Tue, 22 Apr 2008, Rafael J. Wysocki wrote: > > > > There is a bug in device_add() that IMO can be fixed this way: > > Ok, looks fine. Greg? > > > Index: linux-2.6/drivers/base/core.c > > =================================================================== > > --- linux-2.6.orig/drivers/base/core.c > > +++ linux-2.6/drivers/base/core.c > > @@ -820,11 +820,11 @@ int device_add(struct device *dev) > > error = bus_add_device(dev); > > if (error) > > goto BusError; > > + bus_attach_device(dev); > > error = device_pm_add(dev); > > if (error) > > goto PMError; > > kobject_uevent(&dev->kobj, KOBJ_ADD); > > - bus_attach_device(dev); > > if (parent) > > klist_add_tail(&dev->knode_parent, &parent->klist_children); > > > > The problem is that bus_remove_device() assumes bus_attach_device() to have > > run, AFAICS. > > As to the other issue: > > > > So I would suggest reverting that commit, or at least just making it a > > > warning (while still registering the device). > > > > Are drivers supposed to register children of suspended devices? That doesn't > > make much sense IMO ... > > Well, that's why I think the warning itself makes sense - and then we can > decide whether it makes sense for that particular case or not. Clearly it > happens (since it triggered), now we need to figure out _why_ it happened. Well, this particular case looks like a race to me. > But I don't think debugging messages should change behaviour. Okay, below is a more complete patch that changes device_pm_add() so that it doesn't refuse to add children of suspended devices. Note, however, that even if such a registration succeeds, it will probably lead to some future problems. Thanks, Rafael --- Do not refuse to actually register children of suspended devices, but still warn about attempts to do that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/base/power/main.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) Index: linux-2.6/drivers/base/power/main.c =================================================================== --- linux-2.6.orig/drivers/base/power/main.c +++ linux-2.6/drivers/base/power/main.c @@ -62,7 +62,7 @@ static bool all_sleeping; */ int device_pm_add(struct device *dev) { - int error = 0; + int error; pr_debug("PM: Adding info for %s:%s\n", dev->bus ? dev->bus->name : "No Bus", @@ -70,18 +70,15 @@ int device_pm_add(struct device *dev) mutex_lock(&dpm_list_mtx); if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) { if (dev->parent->power.sleeping) - dev_warn(dev, - "parent %s is sleeping, will not add\n", + dev_warn(dev, "parent %s is sleeping\n", dev->parent->bus_id); else - dev_warn(dev, "devices are sleeping, will not add\n"); + dev_warn(dev, "all devices are sleeping\n"); WARN_ON(true); - error = -EBUSY; - } else { - error = dpm_sysfs_add(dev); - if (!error) - list_add_tail(&dev->power.entry, &dpm_active); } + error = dpm_sysfs_add(dev); + if (!error) + list_add_tail(&dev->power.entry, &dpm_active); mutex_unlock(&dpm_list_mtx); return error; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) 2008-04-23 0:50 ` Rafael J. Wysocki @ 2008-04-23 14:56 ` Alan Stern 0 siblings, 0 replies; 183+ messages in thread From: Alan Stern @ 2008-04-23 14:56 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Greg KH, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, herbert, pm list On Wed, 23 Apr 2008, Rafael J. Wysocki wrote: > > > Are drivers supposed to register children of suspended devices? That doesn't > > > make much sense IMO ... > > > > Well, that's why I think the warning itself makes sense - and then we can > > decide whether it makes sense for that particular case or not. Clearly it > > happens (since it triggered), now we need to figure out _why_ it happened. > > Well, this particular case looks like a race to me. I think the reason it happened is clear enough. Call it a race if you want, but the window is so large that it hardly qualifies. It's a result of the way the MMC core is written. There's an upper-level controller device, and below that is a host device, and below that is the card itself. The code that adds and removes children of the host device runs as part of the controller driver. Hence the problem: The driver adds children below the _host_ as soon as the _controller_ is resumed, even though the host is still suspended. It's not as big an error as it sounds -- the host was originally a class_device and then got converted over to a regular device. It doesn't have a driver of its own. This is one of the things that needs to be fixed up as part of the reworking of the system-sleep API. I simply haven't had any time to work on it (and I'm not likely to in the near future). You ought to be able to provoke this more or less at will, depending on the order in which your PCI devices are probed, by inserting or removing an MMC card during the disk spin-up interval while the system is waking up. Alan Stern ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 18:48 ` Linus Torvalds 2008-04-22 20:34 ` device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) Rafael J. Wysocki @ 2008-04-23 8:50 ` Zdenek Kabelac 2008-04-23 15:53 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-23 8:50 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert 2008/4/22, Linus Torvalds <torvalds@linux-foundation.org>: > > > On Tue, 22 Apr 2008, Zdenek Kabelac wrote: > > > > Unsure how it is related to my orginal Oops post - but now when I've > > debug pagealloc enabled this appeared in my log after resume - should > > I open new bug for this - or could this be part of the problem I've > > experienced later? > > > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > > > sd 0:0:0:0: [sda] Starting disk > > mmc0: new SD card at address 5a61 > > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > > ------------[ cut here ]------------ > > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() > > > This is unrelated to the other issue, I think. > Hi This time I've got slightly larger mess with some other oopses - I'm not sure if they are just a consequence of the PM bad commit - or they are a separate issue ? Is there actually some patch I should test from those posted in the list ? Here goes the oops log: (SPIN LOCK already disabled is my personal trace ooops which is just checking if the spin_lock_irq is already called with disabled irq - in this place probably irqsave version should be used instead, otherwice it's not properly restored) PM: Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.46 seconds) done. Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done. Suspending console(s) drm_sysfs_suspend ACPI: PCI interrupt for device 0000:00:02.0 disabled sd 0:0:0:0: [sda] Synchronizing SCSI cache sd 0:0:0:0: [sda] Stopping disk mmc0: card 5a61 removed MMC: killing requests for dead queue ACPI: PCI interrupt for device 0000:15:00.2 disabled ACPI: PCI interrupt for device 0000:00:1f.1 disabled ACPI: PCI interrupt for device 0000:00:1d.7 disabled ACPI: PCI interrupt for device 0000:00:1d.2 disabled ACPI: PCI interrupt for device 0000:00:1d.1 disabled ACPI: PCI interrupt for device 0000:00:1d.0 disabled ACPI: PCI interrupt for device 0000:00:1b.0 disabled ACPI: PCI interrupt for device 0000:00:1a.7 disabled ACPI: PCI interrupt for device 0000:00:1a.1 disabled ACPI: PCI interrupt for device 0000:00:1a.0 disabled ACPI: PCI interrupt for device 0000:00:19.0 disabled ACPI: Preparing to enter system sleep state S3 Disabling non-boot CPUs ... kvm: disabling virtualization on CPU1 CPU 1 is now offline lockdep: fixing up alternatives. SMP alternatives: switching to UP code CPU1 is down Extended CMOS year: 2000 hwsleep-0322 [00] enter_sleep_state : Entering sleep state [S3] x86: PAT support disabled. Extended CMOS year: 2000 Enabling non-boot CPUs ... lockdep: fixing up alternatives. SMP alternatives: switching to SMP code Booting processor 1/1 ip 6000 Initializing CPU#1 Calibrating delay using timer specific routine.. 4390.79 BogoMIPS (lpj=7314872) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 x86: PAT support disabled. SPIN IRQ ALREADY DISABLED Pid: 0, comm: swapper Not tainted 2.6.25 #57 Call Trace: [_spin_lock_irq+126/128] _spin_lock_irq+0x7e/0x80 [lock_ipi_call_lock+16/32] lock_ipi_call_lock+0x10/0x20 CPU1: Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz [start_secondary+68/206] start_secondary+0x44/0xce stepping 0a kvm: enabling virtualization on CPU1 CPU1 is up ACPI: EC: missing OBF confirmation, don't expect it any longer. ACPI: EC: missing write data confirmation, don't expect it any longer. ACPI: \_SB_.GDCK - docking ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20 ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 20 (level, low) -> IRQ 20 usb usb3: root hub lost power or was reset ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 21 (level, low) -> IRQ 21 usb usb4: root hub lost power or was reset ACPI: PCI Interrupt 0000:00:1a.7[C] -> GSI 22 (level, low) -> IRQ 22 ACPI: PCI Interrupt 0000:00:1b.0[B] -> GSI 17 (level, low) -> IRQ 17 ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 16 usb usb5: root hub lost power or was reset ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 17 (level, low) -> IRQ 17 usb usb6: root hub lost power or was reset ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 18 usb usb7: root hub lost power or was reset ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 19 ACPI: PCI Interrupt 0000:00:1f.1[C] -> GSI 16 (level, low) -> IRQ 16 ata4.00: ACPI cmd ef/03:42:00:00:00:a0 filtered out ata4.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out ata4.00: configured for UDMA/33 ACPI: PCI Interrupt 0000:15:00.2[C] -> GSI 18 (level, low) -> IRQ 18 sd 0:0:0:0: [sda] Starting disk mmc0: new SD card at address 5a61 mmc mmc0:5a61: parent mmc0 is sleeping, will not add ------------[ cut here ]------------ WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() Modules linked in: nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel snd_seq_oss arc4 snd_seq_midi_event ecb snd_seq crypto_blkcipher cryptomgr snd_seq_device crypto_algapi snd_pcm_oss iwl3945 snd_mixer_oss snd_pcm mac80211 video thinkpad_acpi psmouse snd_timer backlight i2c_i801 rtc_cmos snd rtc_core iTCO_wdt evdev i2c_core cfg80211 soundcore nvram snd_page_alloc e1000e output mmc_block serio_raw rtc_lib iTCO_vendor_support sdhci mmc_core ac battery intel_agp button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 1090, comm: kmmcd Not tainted 2.6.25 #57 Call Trace: [warn_on_slowpath+95/144] warn_on_slowpath+0x5f/0x90 [device_pm_add+24/240] ? device_pm_add+0x18/0xf0 [device_pm_add+108/240] device_pm_add+0x6c/0xf0 [device_add+1092/1376] device_add+0x444/0x560 [_end+509508458/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 [_end+509515815/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 [_end+509507064/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 [_end+509507433/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 [run_workqueue+246/560] run_workqueue+0xf6/0x230 [worker_thread+167/288] worker_thread+0xa7/0x120 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [worker_thread+0/288] ? worker_thread+0x0/0x120 [kthread+73/144] kthread+0x49/0x90 [child_rip+10/18] child_rip+0xa/0x12 [restore_args+0/48] ? restore_args+0x0/0x30 [kthread+0/144] ? kthread+0x0/0x90 [child_rip+0/18] ? child_rip+0x0/0x12 ---[ end trace ca143223eefdc828 ]--- BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 IP: [klist_del+29/128] klist_del+0x1d/0x80 PGD 0 Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel snd_seq_oss arc4 snd_seq_midi_event ecb snd_seq crypto_blkcipher cryptomgr snd_seq_device crypto_algapi snd_pcm_oss iwl3945 snd_mixer_oss snd_pcm mac80211 video thinkpad_acpi psmouse snd_timer backlight i2c_i801 rtc_cmos snd rtc_core iTCO_wdt evdev i2c_core cfg80211 soundcore nvram snd_page_alloc e1000e output mmc_block serio_raw rtc_lib iTCO_vendor_support sdhci mmc_core ac battery intel_agp button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 1090, comm: kmmcd Not tainted 2.6.25 #57 RIP: 0010:[klist_del+29/128] [klist_del+29/128] klist_del+0x1d/0x80 RSP: 0000:ffff81007c4f5d00 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003 RDX: 0000000000000008 RSI: ffffffffa006f308 RDI: 0000000000000000 RBP: ffff81007c4f5d20 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffff8100712cbc78 R12: ffff81007126aaa8 R13: ffffffffa006f260 R14: ffff81007c4f5df0 R15: ffff81007126ab20 FS: 0000000000000000(0000) GS:ffffffff8148c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000050 CR3: 0000000001001000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kmmcd (pid: 1090, threadinfo ffff81007c4f4000, task ffff81007c028000) Stack: ffff81007c4f5d10 0000000000000050 ffff81007126a970 ffffffffa006f260 ffff81007c4f5d50 ffffffff812012fe ffff81007c4f5d50 ffff81007126a970 00000000fffffff0 ffff81007126ab68 ffff81007c4f5db0 ffffffff8120016e Call Trace: [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 [device_add+1358/1376] device_add+0x54e/0x560 [_end+509508458/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 [_end+509515815/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 [_end+509507064/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 [_end+509507433/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 [run_workqueue+246/560] run_workqueue+0xf6/0x230 [worker_thread+167/288] worker_thread+0xa7/0x120 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [worker_thread+0/288] ? worker_thread+0x0/0x120 [kthread+73/144] kthread+0x49/0x90 [child_rip+10/18] child_rip+0xa/0x12 [restore_args+0/48] ? restore_args+0x0/0x30 [kthread+0/144] ? kthread+0x0/0x90 [child_rip+0/18] ? child_rip+0x0/0x12 Code: 8b 28 41 0f 95 c7 eb 87 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 4c 89 65 f0 48 89 5d e8 4c 89 6d f8 49 89 fc 48 8b 1f 48 89 df <4c> 8b 6b 50 e8 9a 40 01 00 49 8d 7c 24 18 48 c7 c6 20 a4 2d 81 RIP [klist_del+29/128] klist_del+0x1d/0x80 RSP <ffff81007c4f5d00> CR2: 0000000000000050 ---[ end trace ca143223eefdc828 ]--- ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out ata1.00: configured for UDMA/100 ata1.00: configured for UDMA/100 ata1: EH complete sd 0:0:0:0: [sda] 195371568 512-byte hardware sectors (100030 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 195371568 512-byte hardware sectors (100030 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16 Restarting tasks ... <6>usb 3-2: USB disconnect, address 2 Apr 23 10:25:37 localhost hald[2469]: forcibly attempting to lazy unmount /dev/mmcblk0p1 as enclosing drive was disconnected Apr 23 10:25:37 localhost gnome-power-manager: (kabi) Probuzenà poÄÃtaÄe Apr 23 10:25:37 localhost kernel: [19631.081098] done. Apr 23 10:25:38 localhost hald: unmounted /dev/mmcblk0p1 from '/media/disk' on behalf of uid 0 input: Virtual ThinkFinger Keyboard as /devices/virtual/input/input17 usb 1-4: new high speed USB device using ehci_hcd and address 4 usb 1-4: configuration #1 chosen from 1 choice hub 1-4:1.0: USB hub found hub 1-4:1.0: 4 ports detected usb 1-4: New USB device found, idVendor=04b3, idProduct=4485 usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0 usb 3-2: new full speed USB device using uhci_hcd and address 3 Apr 23:25:38 localhost console-kit-daemon[2472]: WARNING: Couldn't read /proc/16639/environ: Failed to open file '/proc/16639/environ': No such file or directory usb 3-2: configuration #1 chosen from 1 choice usb 3-2: New USB device found, idVendor=0483, idProduct=2016 usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 3-2: Product: Biometric Coprocessor usb 3-2: Manufacturer: STMicroelectronics ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 17 ============================================================================= BUG kmalloc-4096: Padding overwritten. 0x0000000000000000-0x00000000ffffffff ----------------------------------------------------------------------------- INFO: Slab 0xffffe20000c09c00 used=7 fp=0x0000000000000000 flags=0x2200000004083 Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 Call Trace: [slab_err+167/192] slab_err+0xa7/0xc0 [__free_pages_ok+420/1216] ? __free_pages_ok+0x1a4/0x4c0 [kernel_map_pages+168/368] ? kernel_map_pages+0xa8/0x170 [add_partial+33/112] ? add_partial+0x21/0x70 [slab_pad_check+287/368] slab_pad_check+0x11f/0x170 [check_slab+34/112] check_slab+0x22/0x70 [__slab_free+458/944] __slab_free+0x1ca/0x3b0 [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 [kfree+180/304] kfree+0xb4/0x130 [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 [skb_release_data+133/208] skb_release_data+0x85/0xd0 [skb_release_all+158/240] skb_release_all+0x9e/0xf0 [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 [kfree_skb+23/64] kfree_skb+0x17/0x40 [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 [dev_open+121/176] dev_open+0x79/0xb0 [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 [do_setlink+524/928] do_setlink+0x20c/0x3a0 [_read_unlock+48/96] ? _read_unlock+0x30/0x60 [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Padding 0xffff8100201a0010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Padding 0xffff8100201a0020: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk ........... a lots of these ....... Padding 0xffff8100201a7190: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk¥ Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÌÌÌÌÌÌÌÌ......ÿÿ Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Íp..ÿÿÿÿ....s... Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Õq&.ÿÿÿÿ Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX..... Padding 0xffff8100201a71f0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ FIX kmalloc-4096: Restoring 0xffff8100201a0000-0xffff8100201a7e16=0x5a ============================================================================= BUG kmalloc-4096: Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff8100201a2048-0xffff8100201a204f. First byte 0x5a instead of 0xcc INFO: Allocated in 0x5a5a5a5a5a5a5a5a age=11936128522583413382 cpu=1515870810 pid=1515870810 INFO: Freed in 0x5a5a5a5a5a5a5a5a age=11936128522583413382 cpu=1515870810 pid=1515870810 INFO: Slab 0xffffe20000c09c00 used=7 fp=0x0000000000000000 flags=0x2200000004083 INFO: Object 0xffff8100201a1048 @offset=4168 fp=0x5a5a5a5a5a5a5a5a Bytes b4 0xffff8100201a1038: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1048: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1058: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1068: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1078: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1088: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a1098: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a10a8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff8100201a10b8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Redzone 0xffff8100201a2048: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Padding 0xffff8100201a2088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 Call Trace: [print_trailer+330/448] print_trailer+0x14a/0x1c0 [check_bytes_and_report+293/384] check_bytes_and_report+0x125/0x180 [check_object+102/624] check_object+0x66/0x270 [__slab_free+683/944] __slab_free+0x2ab/0x3b0 [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 [kfree+180/304] kfree+0xb4/0x130 [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 [skb_release_data+133/208] skb_release_data+0x85/0xd0 [skb_release_all+158/240] skb_release_all+0x9e/0xf0 [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 [kfree_skb+23/64] kfree_skb+0x17/0x40 [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 [dev_open+121/176] dev_open+0x79/0xb0 [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 [do_setlink+524/928] do_setlink+0x20c/0x3a0 [_read_unlock+48/96] ? _read_unlock+0x30/0x60 [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 FIX kmalloc-4096: Restoring 0xffff8100201a2048-0xffff8100201a204f=0xcc general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC CPU 1 Modules linked in: nls_iso8859_2 nls_cp852 vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm snd_hda_intel snd_seq_oss arc4 snd_seq_midi_event ecb snd_seq crypto_blkcipher cryptomgr snd_seq_device crypto_algapi snd_pcm_oss iwl3945 snd_mixer_oss snd_pcm mac80211 video thinkpad_acpi psmouse snd_timer backlight i2c_i801 rtc_cmos snd rtc_core iTCO_wdt evdev i2c_core cfg80211 soundcore nvram snd_page_alloc e1000e output mmc_block serio_raw rtc_lib iTCO_vendor_support sdhci mmc_core ac battery intel_agp button uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: microcode] Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 RIP: 0010:[put_page+14/256] [put_page+14/256] put_page+0xe/0x100 RSP: 0018:ffff81007c3bb5f8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000 RDX: ffff8100201a5d28 RSI: 00000000201a516c RDI: 5a5a5a5a5a5a5a5a RBP: ffff81007c3bb618 R08: ffff81007d355bd0 R09: ffff81006a96b0d8 R10: ffffe200027f8820 R11: ffff81006a96b000 R12: ffff81006a96b3c0 R13: ffff81007d352ba0 R14: ffff81007d351f00 R15: ffff81007d355bd0 FS: 00007f59fb63e780(0000) GS:ffff81007e02e190(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003a6cf6ade0 CR3: 0000000073960000 CR4: 00000000000026a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process NetworkManager (pid: 2621, threadinfo ffff81007c3ba000, task ffff81007245c000) Stack: 0000000000000001 ffff81006a96b3c0 ffff81007d352ba0 ffff81007d351f00 ffff81007c3bb638 ffffffff812671fb ffff81006a96b3c0 00000000000000b1 ffff81007c3bb658 ffffffff81267bee ffff81007d351f00 ffff81006a96b3c0 Call Trace: [skb_release_data+171/208] skb_release_data+0xab/0xd0 [skb_release_all+158/240] skb_release_all+0x9e/0xf0 [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 [kfree_skb+23/64] kfree_skb+0x17/0x40 [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 [dev_open+121/176] dev_open+0x79/0xb0 [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 [do_setlink+524/928] do_setlink+0x20c/0x3a0 [_read_unlock+48/96] ? _read_unlock+0x30/0x60 [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 Code: ff 41 54 9d eb e4 48 8b 47 10 0f 1f 00 e9 62 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb <48> 8b 07 f6 c4 40 75 26 8b 4f 08 85 c9 75 0b 0f 0b eb fe 0f 1f RIP [put_page+14/256] put_page+0xe/0x100 RSP <ffff81007c3bb5f8> ---[ end trace ca143223eefdc828 ]--- SPIN IRQ ALREADY DISABLED Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 Call Trace: [_spin_lock_irq+126/128] _spin_lock_irq+0x7e/0x80 [exit_signals+85/304] exit_signals+0x55/0x130 [do_exit+133/2192] do_exit+0x85/0x890 [rotate_reclaimable_page+211/240] ? rotate_reclaimable_page+0xd3/0xf0 [do_unblank_screen+29/368] ? do_unblank_screen+0x1d/0x170 [oops_end+136/144] oops_end+0x88/0x90 [die+94/144] die+0x5e/0x90 [do_general_protection+344/368] do_general_protection+0x158/0x170 [error_exit+0/169] error_exit+0x0/0xa9 [put_page+14/256] ? put_page+0xe/0x100 [skb_release_data+171/208] ? skb_release_data+0xab/0xd0 [skb_release_all+158/240] ? skb_release_all+0x9e/0xf0 [__kfree_skb+17/160] ? __kfree_skb+0x11/0xa0 [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 [kfree_skb+23/64] ? kfree_skb+0x17/0x40 [_end+510638598/2109230024] ? :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 [_end+510662510/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 [_end+510613961/2109230024] ? :iwl3945:__iwl3945_up+0x91/0x640 [_end+510616880/2109230024] ? :iwl3945:iwl3945_mac_start+0x568/0x790 [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 [_end+510327174/2109230024] ? :mac80211:ieee80211_open+0x13e/0x590 [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 [dev_open+121/176] ? dev_open+0x79/0xb0 [dev_change_flags+153/464] ? dev_change_flags+0x99/0x1d0 [do_setlink+524/928] ? do_setlink+0x20c/0x3a0 [_read_unlock+48/96] ? _read_unlock+0x30/0x60 [rtnl_setlink+269/336] ? rtnl_setlink+0x10d/0x150 [rtnetlink_rcv_msg+397/576] ? rtnetlink_rcv_msg+0x18d/0x240 [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 [netlink_rcv_skb+137/176] ? netlink_rcv_skb+0x89/0xb0 [rtnetlink_rcv+41/64] ? rtnetlink_rcv+0x29/0x40 [netlink_unicast+709/736] ? netlink_unicast+0x2c5/0x2e0 [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 [netlink_sendmsg+498/752] ? netlink_sendmsg+0x1f2/0x2f0 [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 [sys_sendmsg+393/800] ? sys_sendmsg+0x189/0x320 [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80 note: NetworkManager[2621] exited with preempt_count 1 BUG: sleeping function called from invalid context at kernel/rwsem.c:21 in_atomic():1, irqs_disabled():0 INFO: lockdep is turned off. Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 Call Trace: [__debug_show_held_locks+35/48] ? __debug_show_held_locks+0x23/0x30 [__might_sleep+209/256] __might_sleep+0xd1/0x100 [down_read+32/112] down_read+0x20/0x70 [futex_wake+60/304] futex_wake+0x3c/0x130 [sprintf+104/112] ? sprintf+0x68/0x70 [do_futex+159/3440] do_futex+0x9f/0xd70 [_spin_unlock_irqrestore+133/144] ? _spin_unlock_irqrestore+0x85/0x90 [release_console_sem+524/544] ? release_console_sem+0x20c/0x220 [vprintk+1008/1232] ? vprintk+0x3f0/0x4d0 [sys_futex+180/320] sys_futex+0xb4/0x140 [acct_collect+435/496] ? acct_collect+0x1b3/0x1f0 [acct_collect+435/496] ? acct_collect+0x1b3/0x1f0 [mm_release+142/160] mm_release+0x8e/0xa0 [exit_mm+29/304] exit_mm+0x1d/0x130 [do_exit+461/2192] do_exit+0x1cd/0x890 [rotate_reclaimable_page+211/240] ? rotate_reclaimable_page+0xd3/0xf0 [do_unblank_screen+29/368] ? do_unblank_screen+0x1d/0x170 [oops_end+136/144] oops_end+0x88/0x90 [die+94/144] die+0x5e/0x90 [do_general_protection+344/368] do_general_protection+0x158/0x170 [error_exit+0/169] error_exit+0x0/0xa9 [put_page+14/256] ? put_page+0xe/0x100 [skb_release_data+171/208] ? skb_release_data+0xab/0xd0 [skb_release_all+158/240] ? skb_release_all+0x9e/0xf0 [__kfree_skb+17/160] ? __kfree_skb+0x11/0xa0 [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 [kfree_skb+23/64] ? kfree_skb+0x17/0x40 [_end+510638598/2109230024] ? :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 [_end+510662510/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 [_end+510613961/2109230024] ? :iwl3945:__iwl3945_up+0x91/0x640 [_end+510616880/2109230024] ? :iwl3945:iwl3945_mac_start+0x568/0x790 [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 [_end+510327174/2109230024] ? :mac80211:ieee80211_open+0x13e/0x590 [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 [dev_open+121/176] ? dev_open+0x79/0xb0 [dev_change_flags+153/464] ? dev_change_flags+0x99/0x1d0 [do_setlink+524/928] ? do_setlink+0x20c/0x3a0 [_read_unlock+48/96] ? _read_unlock+0x30/0x60 [rtnl_setlink+269/336] ? rtnl_setlink+0x10d/0x150 [rtnetlink_rcv_msg+397/576] ? rtnetlink_rcv_msg+0x18d/0x240 [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 [netlink_rcv_skb+137/176] ? netlink_rcv_skb+0x89/0xb0 [rtnetlink_rcv+41/64] ? rtnetlink_rcv+0x29/0x40 [netlink_unicast+709/736] ? netlink_unicast+0x2c5/0x2e0 [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 [netlink_sendmsg+498/752] ? netlink_sendmsg+0x1f2/0x2f0 [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 [sys_sendmsg+393/800] ? sys_sendmsg+0x189/0x320 [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80 NetworkManager used greatest stack depth: 2928 bytes left eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: None ACPI: \_SB_.GDCK - undocking usb 1-4: USB disconnect, address 4 ACPI: \_SB_.GDCK - docking usb 1-4: new high speed USB device using ehci_hcd and address 5 usb 1-4: configuration #1 chosen from 1 choice hub 1-4:1.0: USB hub found hub 1-4:1.0: 4 ports detected usb 1-4: New USB device found, idVendor=04b3, idProduct=4485 usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0 SysRq : Emergency Sync Emergency Sync complete SysRq : Emergency Remount R/O ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 8:50 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Zdenek Kabelac @ 2008-04-23 15:53 ` Linus Torvalds 2008-04-23 16:58 ` Pekka Enberg ` (3 more replies) 0 siblings, 4 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-23 15:53 UTC (permalink / raw) To: Zdenek Kabelac Cc: Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On Wed, 23 Apr 2008, Zdenek Kabelac wrote: > > This time I've got slightly larger mess with some other oopses - I'm > not sure if they are just a consequence of the PM bad commit - or they > are a separate issue ? Goodie, two of the backtraces (the parent-is-sleeping warning and the immediately subsequent oops) look like the same thing that should already be fixed in current -git. But there is some interesting stuff there.. > (SPIN LOCK already disabled is my personal trace ooops which is just > checking if the spin_lock_irq is already called with disabled irq - in > this place probably irqsave version should be used instead, otherwice > it's not properly restored) Yes, that's interesting to see. > Booting processor 1/1 ip 6000 > Initializing CPU#1 > Calibrating delay using timer specific routine.. 4390.79 BogoMIPS (lpj=7314872) > CPU: L1 I cache: 32K, L1 D cache: 32K > CPU: L2 cache: 4096K > CPU: Physical Processor ID: 0 > CPU: Processor Core ID: 1 > x86: PAT support disabled. > SPIN IRQ ALREADY DISABLED > Pid: 0, comm: swapper Not tainted 2.6.25 #57 > > Call Trace: > [_spin_lock_irq+126/128] _spin_lock_irq+0x7e/0x80 > [lock_ipi_call_lock+16/32] lock_ipi_call_lock+0x10/0x20 > CPU1: Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz > [start_secondary+68/206] start_secondary+0x44/0xce This is indeed an interesting issue: arch/x86/kernel/smpboot.c does an IPI call to start_secondary, and yes, it looks suspicious to have that lock_ipi_call_lock there (and in particular the unlock_ipi_call_lock that enables interrupts within it). Ingo? But the really interesting one is the later kmalloc() debugging triggers, because this one is, I suspect, very much a sign of the memory corruption bug you see. There's two reasons that make me say that: - the callback is in networking code and wireless, which was one of the possible suspects. - the padding pattern which *should* have been POISON_INUSE (0x5a) has been overwritten with: Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk .... Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk¥ Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÌÌÌÌÌÌÌÌ......ÿÿ Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Íp..ÿÿÿÿ....s... Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Õq&.ÿÿÿÿ Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX..... which in turn is interesting because it very much looks like SLUB re-used a page for something else (the values that things got overwritten by are largely SLUB's own poison bytes: 6b is POISON_FREE, the a5 at the end of the list of 6b's is POISON_END, while cc is SLUB_RED_ACTIVE). To me, that pattern looks like an order-3 allocation (correct: that's what kmalloc-4096 is supposed to be using!) got released, and the stuff at the end (with slub debugging, there's only room for 7 4096-byte allocations there, so 71b0 is past the end) in that SLUB debug info. The first word of that busy allocation is ffff8100201a0000, which is also the base pointer to the whole order-3 page ("Free pointer"), followed by the SLAB tracking data. Looks like possibly a double free to me (with the first free caused the page to be re-used, the second free is the one that triggers the debug message). But maybe Pekka or Christoph are better at reading those oopses. Now, the first slab debug trigger then does: FIX kmalloc-4096: Restoring 0xffff8100201a0000-0xffff8100201a7e16=0x5a to "restore" the data to its expected values, which is why the *second* one triggers, because now the allocation that was re-used got overwritten with that free pattern, and then you get more complaints about *that*, and the skb pointers themselves now have bogus data in them (overwritten twice: first with 0x5a, to restore the first one, then with 0xcc for the second warning. So then the subsequent "general protection fault" is just because of bogus skb pointers due to the still-in-use allocation being overwritten by all these poison values. And finally, the stuff at the very end (BUG: sleeping function called from invalid context and the SPIN IRQ one) are just warnings because we killed a process in a critical section, so all the preempt and irq flags are just wrong. Those can be ignored entirely. But what is interesting is that this does look networking-related. I suspect it's the suspend/resume that triggers something with the dev_open() thing, which re-uses an already-free'd pointer or whatever. I have no clue about exactly what goes wrong, but I really would suspect that whole "network device down/up" sequence during the suspend. I've left the kernel trace appended, since I added a few more people to the discussion. Linus --- > ============================================================================= > BUG kmalloc-4096: Padding overwritten. 0x0000000000000000-0x00000000ffffffff > ----------------------------------------------------------------------------- > > INFO: Slab 0xffffe20000c09c00 used=7 fp=0x0000000000000000 flags=0x2200000004083 > Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 > > Call Trace: > [slab_err+167/192] slab_err+0xa7/0xc0 > [__free_pages_ok+420/1216] ? __free_pages_ok+0x1a4/0x4c0 > [kernel_map_pages+168/368] ? kernel_map_pages+0xa8/0x170 > [add_partial+33/112] ? add_partial+0x21/0x70 > [slab_pad_check+287/368] slab_pad_check+0x11f/0x170 > [check_slab+34/112] check_slab+0x22/0x70 > [__slab_free+458/944] __slab_free+0x1ca/0x3b0 > [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 > [kfree+180/304] kfree+0xb4/0x130 > [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 > [skb_release_data+133/208] skb_release_data+0x85/0xd0 > [skb_release_all+158/240] skb_release_all+0x9e/0xf0 > [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 > [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 > [kfree_skb+23/64] kfree_skb+0x17/0x40 > [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 > [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 > [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 > [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 > [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 > [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 > [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 > [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 > [dev_open+121/176] dev_open+0x79/0xb0 > [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 > [do_setlink+524/928] do_setlink+0x20c/0x3a0 > [_read_unlock+48/96] ? _read_unlock+0x30/0x60 > [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 > [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 > [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 > [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 > [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 > [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 > [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 > [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 > [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 > [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 > [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 > [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 > [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 > [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 > [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a > [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 > > Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Padding 0xffff8100201a0010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Padding 0xffff8100201a0020: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > ........... a lots of these ....... > Padding 0xffff8100201a7190: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk¥ > Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÌÌÌÌÌÌÌÌ......ÿÿ > Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Íp..ÿÿÿÿ....s... > Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Õq&.ÿÿÿÿ > Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX..... > Padding 0xffff8100201a71f0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ > FIX kmalloc-4096: Restoring 0xffff8100201a0000-0xffff8100201a7e16=0x5a > > ============================================================================= > BUG kmalloc-4096: Redzone overwritten > ----------------------------------------------------------------------------- > > INFO: 0xffff8100201a2048-0xffff8100201a204f. First byte 0x5a instead of 0xcc > INFO: Allocated in 0x5a5a5a5a5a5a5a5a age=11936128522583413382 cpu=1515870810 pid=1515870810 > INFO: Freed in 0x5a5a5a5a5a5a5a5a age=11936128522583413382 cpu=1515870810 pid=1515870810 > INFO: Slab 0xffffe20000c09c00 used=7 fp=0x0000000000000000 flags=0x2200000004083 > INFO: Object 0xffff8100201a1048 @offset=4168 fp=0x5a5a5a5a5a5a5a5a > > Bytes b4 0xffff8100201a1038: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1048: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1058: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1068: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1078: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1088: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a1098: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a10a8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Object 0xffff8100201a10b8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > Redzone 0xffff8100201a2048: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ > Padding 0xffff8100201a2088: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ > Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 > > Call Trace: > [print_trailer+330/448] print_trailer+0x14a/0x1c0 > [check_bytes_and_report+293/384] check_bytes_and_report+0x125/0x180 > [check_object+102/624] check_object+0x66/0x270 > [__slab_free+683/944] __slab_free+0x2ab/0x3b0 > [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 > [kfree+180/304] kfree+0xb4/0x130 > [skb_release_data+133/208] ? skb_release_data+0x85/0xd0 > [skb_release_data+133/208] skb_release_data+0x85/0xd0 > [skb_release_all+158/240] skb_release_all+0x9e/0xf0 > [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 > [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 > [kfree_skb+23/64] kfree_skb+0x17/0x40 > [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 > [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 > [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 > [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 > [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 > [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 > [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 > [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 > [dev_open+121/176] dev_open+0x79/0xb0 > [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 > [do_setlink+524/928] do_setlink+0x20c/0x3a0 > [_read_unlock+48/96] ? _read_unlock+0x30/0x60 > [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 > [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 > [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 > [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 > [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 > [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 > [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 > [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 > [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 > [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 > [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 > [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 > [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 > [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 > [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a > [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 > > FIX kmalloc-4096: Restoring 0xffff8100201a2048-0xffff8100201a204f=0xcc > > general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC > CPU 1 > Modules linked in: nls_iso8859_2 nls_cp852 vfat fat i915 drm > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state > nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables > bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc > binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm > snd_hda_intel snd_seq_oss arc4 snd_seq_midi_event ecb snd_seq > crypto_blkcipher cryptomgr snd_seq_device crypto_algapi snd_pcm_oss > iwl3945 snd_mixer_oss snd_pcm mac80211 video thinkpad_acpi psmouse > snd_timer backlight i2c_i801 rtc_cmos snd rtc_core iTCO_wdt evdev > i2c_core cfg80211 soundcore nvram snd_page_alloc e1000e output > mmc_block serio_raw rtc_lib iTCO_vendor_support sdhci mmc_core ac > battery intel_agp button uhci_hcd ohci_hcd ehci_hcd usbcore [last > unloaded: microcode] > Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 > RIP: 0010:[put_page+14/256] [put_page+14/256] put_page+0xe/0x100 > RSP: 0018:ffff81007c3bb5f8 EFLAGS: 00010046 > RAX: 0000000000000000 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000 > RDX: ffff8100201a5d28 RSI: 00000000201a516c RDI: 5a5a5a5a5a5a5a5a > RBP: ffff81007c3bb618 R08: ffff81007d355bd0 R09: ffff81006a96b0d8 > R10: ffffe200027f8820 R11: ffff81006a96b000 R12: ffff81006a96b3c0 > R13: ffff81007d352ba0 R14: ffff81007d351f00 R15: ffff81007d355bd0 > FS: 00007f59fb63e780(0000) GS:ffff81007e02e190(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000003a6cf6ade0 CR3: 0000000073960000 CR4: 00000000000026a0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process NetworkManager (pid: 2621, threadinfo ffff81007c3ba000, task > ffff81007245c000) > Stack: 0000000000000001 ffff81006a96b3c0 ffff81007d352ba0 ffff81007d351f00 > ffff81007c3bb638 ffffffff812671fb ffff81006a96b3c0 00000000000000b1 > ffff81007c3bb658 ffffffff81267bee ffff81007d351f00 ffff81006a96b3c0 > Call Trace: > [skb_release_data+171/208] skb_release_data+0xab/0xd0 > [skb_release_all+158/240] skb_release_all+0x9e/0xf0 > [__kfree_skb+17/160] __kfree_skb+0x11/0xa0 > [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 > [kfree_skb+23/64] kfree_skb+0x17/0x40 > [_end+510638598/2109230024] :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 > [_end+510662510/2109230024] :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 > [_end+510613961/2109230024] :iwl3945:__iwl3945_up+0x91/0x640 > [_end+510616880/2109230024] :iwl3945:iwl3945_mac_start+0x568/0x790 > [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 > [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 > [_end+510327174/2109230024] :mac80211:ieee80211_open+0x13e/0x590 > [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 > [dev_open+121/176] dev_open+0x79/0xb0 > [dev_change_flags+153/464] dev_change_flags+0x99/0x1d0 > [do_setlink+524/928] do_setlink+0x20c/0x3a0 > [_read_unlock+48/96] ? _read_unlock+0x30/0x60 > [rtnl_setlink+269/336] rtnl_setlink+0x10d/0x150 > [rtnetlink_rcv_msg+397/576] rtnetlink_rcv_msg+0x18d/0x240 > [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 > [netlink_rcv_skb+137/176] netlink_rcv_skb+0x89/0xb0 > [rtnetlink_rcv+41/64] rtnetlink_rcv+0x29/0x40 > [netlink_unicast+709/736] netlink_unicast+0x2c5/0x2e0 > [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 > [netlink_sendmsg+498/752] netlink_sendmsg+0x1f2/0x2f0 > [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 > [sock_sendmsg+295/320] sock_sendmsg+0x127/0x140 > [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 > [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 > [sys_sendmsg+393/800] sys_sendmsg+0x189/0x320 > [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 > [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a > [system_call_after_swapgs+123/128] system_call_after_swapgs+0x7b/0x80 > > > Code: ff 41 54 9d eb e4 48 8b 47 10 0f 1f 00 e9 62 ff ff ff 66 66 2e > 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb <48> > 8b 07 f6 c4 40 75 26 8b 4f 08 85 c9 75 0b 0f 0b eb fe 0f 1f > RIP [put_page+14/256] put_page+0xe/0x100 > RSP <ffff81007c3bb5f8> > ---[ end trace ca143223eefdc828 ]--- > SPIN IRQ ALREADY DISABLED > Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 > > Call Trace: > [_spin_lock_irq+126/128] _spin_lock_irq+0x7e/0x80 > [exit_signals+85/304] exit_signals+0x55/0x130 > [do_exit+133/2192] do_exit+0x85/0x890 > [rotate_reclaimable_page+211/240] ? rotate_reclaimable_page+0xd3/0xf0 > [do_unblank_screen+29/368] ? do_unblank_screen+0x1d/0x170 > [oops_end+136/144] oops_end+0x88/0x90 > [die+94/144] die+0x5e/0x90 > [do_general_protection+344/368] do_general_protection+0x158/0x170 > [error_exit+0/169] error_exit+0x0/0xa9 > [put_page+14/256] ? put_page+0xe/0x100 > [skb_release_data+171/208] ? skb_release_data+0xab/0xd0 > [skb_release_all+158/240] ? skb_release_all+0x9e/0xf0 > [__kfree_skb+17/160] ? __kfree_skb+0x11/0xa0 > [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 > [kfree_skb+23/64] ? kfree_skb+0x17/0x40 > [_end+510638598/2109230024] ? :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 > [_end+510662510/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 > [_end+510613961/2109230024] ? :iwl3945:__iwl3945_up+0x91/0x640 > [_end+510616880/2109230024] ? :iwl3945:iwl3945_mac_start+0x568/0x790 > [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 > [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 > [_end+510327174/2109230024] ? :mac80211:ieee80211_open+0x13e/0x590 > [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 > [dev_open+121/176] ? dev_open+0x79/0xb0 > [dev_change_flags+153/464] ? dev_change_flags+0x99/0x1d0 > [do_setlink+524/928] ? do_setlink+0x20c/0x3a0 > [_read_unlock+48/96] ? _read_unlock+0x30/0x60 > [rtnl_setlink+269/336] ? rtnl_setlink+0x10d/0x150 > [rtnetlink_rcv_msg+397/576] ? rtnetlink_rcv_msg+0x18d/0x240 > [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 > [netlink_rcv_skb+137/176] ? netlink_rcv_skb+0x89/0xb0 > [rtnetlink_rcv+41/64] ? rtnetlink_rcv+0x29/0x40 > [netlink_unicast+709/736] ? netlink_unicast+0x2c5/0x2e0 > [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 > [netlink_sendmsg+498/752] ? netlink_sendmsg+0x1f2/0x2f0 > [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 > [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 > [sys_sendmsg+393/800] ? sys_sendmsg+0x189/0x320 > [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 > [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a > [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80 > > note: NetworkManager[2621] exited with preempt_count 1 > BUG: sleeping function called from invalid context at kernel/rwsem.c:21 > in_atomic():1, irqs_disabled():0 > INFO: lockdep is turned off. > Pid: 2621, comm: NetworkManager Tainted: G D 2.6.25 #57 > > Call Trace: > [__debug_show_held_locks+35/48] ? __debug_show_held_locks+0x23/0x30 > [__might_sleep+209/256] __might_sleep+0xd1/0x100 > [down_read+32/112] down_read+0x20/0x70 > [futex_wake+60/304] futex_wake+0x3c/0x130 > [sprintf+104/112] ? sprintf+0x68/0x70 > [do_futex+159/3440] do_futex+0x9f/0xd70 > [_spin_unlock_irqrestore+133/144] ? _spin_unlock_irqrestore+0x85/0x90 > [release_console_sem+524/544] ? release_console_sem+0x20c/0x220 > [vprintk+1008/1232] ? vprintk+0x3f0/0x4d0 > [sys_futex+180/320] sys_futex+0xb4/0x140 > [acct_collect+435/496] ? acct_collect+0x1b3/0x1f0 > [acct_collect+435/496] ? acct_collect+0x1b3/0x1f0 > [mm_release+142/160] mm_release+0x8e/0xa0 > [exit_mm+29/304] exit_mm+0x1d/0x130 > [do_exit+461/2192] do_exit+0x1cd/0x890 > [rotate_reclaimable_page+211/240] ? rotate_reclaimable_page+0xd3/0xf0 > [do_unblank_screen+29/368] ? do_unblank_screen+0x1d/0x170 > [oops_end+136/144] oops_end+0x88/0x90 > [die+94/144] die+0x5e/0x90 > [do_general_protection+344/368] do_general_protection+0x158/0x170 > [error_exit+0/169] error_exit+0x0/0xa9 > [put_page+14/256] ? put_page+0xe/0x100 > [skb_release_data+171/208] ? skb_release_data+0xab/0xd0 > [skb_release_all+158/240] ? skb_release_all+0x9e/0xf0 > [__kfree_skb+17/160] ? __kfree_skb+0x11/0xa0 > [_end+510662350/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x306/0x940 > [kfree_skb+23/64] ? kfree_skb+0x17/0x40 > [_end+510638598/2109230024] ? :iwl3945:iwl3945_rx_queue_reset+0xae/0x130 > [_end+510662510/2109230024] ? :iwl3945:iwl3945_hw_nic_init+0x3a6/0x940 > [_end+510613961/2109230024] ? :iwl3945:__iwl3945_up+0x91/0x640 > [_end+510616880/2109230024] ? :iwl3945:iwl3945_mac_start+0x568/0x790 > [lock_hrtimer_base+44/96] ? lock_hrtimer_base+0x2c/0x60 > [rb_insert_color+265/320] ? rb_insert_color+0x109/0x140 > [_end+510327174/2109230024] ? :mac80211:ieee80211_open+0x13e/0x590 > [dev_set_rx_mode+72/96] ? dev_set_rx_mode+0x48/0x60 > [dev_open+121/176] ? dev_open+0x79/0xb0 > [dev_change_flags+153/464] ? dev_change_flags+0x99/0x1d0 > [do_setlink+524/928] ? do_setlink+0x20c/0x3a0 > [_read_unlock+48/96] ? _read_unlock+0x30/0x60 > [rtnl_setlink+269/336] ? rtnl_setlink+0x10d/0x150 > [rtnetlink_rcv_msg+397/576] ? rtnetlink_rcv_msg+0x18d/0x240 > [rtnetlink_rcv_msg+0/576] ? rtnetlink_rcv_msg+0x0/0x240 > [netlink_rcv_skb+137/176] ? netlink_rcv_skb+0x89/0xb0 > [rtnetlink_rcv+41/64] ? rtnetlink_rcv+0x29/0x40 > [netlink_unicast+709/736] ? netlink_unicast+0x2c5/0x2e0 > [__alloc_skb+110/336] ? __alloc_skb+0x6e/0x150 > [netlink_sendmsg+498/752] ? netlink_sendmsg+0x1f2/0x2f0 > [_read_unlock+78/96] ? _read_unlock+0x4e/0x60 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [sock_recvmsg+313/336] ? sock_recvmsg+0x139/0x150 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [sock_sendmsg+295/320] ? sock_sendmsg+0x127/0x140 > [move_addr_to_kernel+87/96] ? move_addr_to_kernel+0x57/0x60 > [verify_iovec+60/208] ? verify_iovec+0x3c/0xd0 > [sys_sendmsg+393/800] ? sys_sendmsg+0x189/0x320 > [sys_sendto+253/288] ? sys_sendto+0xfd/0x120 > [trace_hardirqs_on_thunk+53/58] ? trace_hardirqs_on_thunk+0x35/0x3a > [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80 > > NetworkManager used greatest stack depth: 2928 bytes left > eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: None > ACPI: \_SB_.GDCK - undocking > usb 1-4: USB disconnect, address 4 > ACPI: \_SB_.GDCK - docking > usb 1-4: new high speed USB device using ehci_hcd and address 5 > usb 1-4: configuration #1 chosen from 1 choice > hub 1-4:1.0: USB hub found > hub 1-4:1.0: 4 ports detected > usb 1-4: New USB device found, idVendor=04b3, idProduct=4485 > usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0 > SysRq : Emergency Sync > Emergency Sync complete > SysRq : Emergency Remount R/O > ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 15:53 ` Linus Torvalds @ 2008-04-23 16:58 ` Pekka Enberg 2008-04-23 17:28 ` Zdenek Kabelac 2008-04-23 17:40 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 183+ messages in thread From: Pekka Enberg @ 2008-04-23 16:58 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 1858 bytes --] On Wed, Apr 23, 2008 at 6:53 PM, Linus Torvalds<torvalds@linux-foundation.org> wrote:> Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk> ....> Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÒ> Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÐÐÐÐÐÐÐÐ......ÑÑ> Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ðp..ÑÑÑÑ....s...> Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Ð¥q&.ÑÑÑÑ>> Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX.....>> which in turn is interesting because it very much looks like SLUB> re-used a page for something else (the values that things got> overwritten by are largely SLUB's own poison bytes: 6b is POISON_FREE,> the a5 at the end of the list of 6b's is POISON_END, while cc is> SLUB_RED_ACTIVE).>> To me, that pattern looks like an order-3 allocation (correct: that's what> kmalloc-4096 is supposed to be using!) got released, and the stuff at the> end (with slub debugging, there's only room for 7 4096-byte allocations> there, so 71b0 is past the end) in that SLUB debug info.>> The first word of that busy allocation is ffff8100201a0000, which is also> the base pointer to the whole order-3 page ("Free pointer"), followed by> the SLAB tracking data. Is the POISON_FREE ("6b") region really contiguous Zdenek? The problemhere is that the object looks to be 29104 bytes that is subject tokmalloc_large() which by-passes SLUB poisoning completely. Pekkaÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 16:58 ` Pekka Enberg @ 2008-04-23 17:28 ` Zdenek Kabelac 0 siblings, 0 replies; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-23 17:28 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 8829 bytes --] 2008/4/23, Pekka Enberg <penberg@cs.helsinki.fi>:> On Wed, Apr 23, 2008 at 6:53 PM, Linus Torvalds> <torvalds@linux-foundation.org> wrote:> > Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk> > ....> > Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÒ> > Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÐÐÐÐÐÐÐÐ......ÑÑ> > Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ðp..ÑÑÑÑ....s...> > Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Ð¥q&.ÑÑÑÑ> >> > Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX.....> >> > which in turn is interesting because it very much looks like SLUB> > re-used a page for something else (the values that things got> > overwritten by are largely SLUB's own poison bytes: 6b is POISON_FREE,> > the a5 at the end of the list of 6b's is POISON_END, while cc is> > SLUB_RED_ACTIVE).> >> > To me, that pattern looks like an order-3 allocation (correct: that's what> > kmalloc-4096 is supposed to be using!) got released, and the stuff at the> > end (with slub debugging, there's only room for 7 4096-byte allocations> > there, so 71b0 is past the end) in that SLUB debug info.> >> > The first word of that busy allocation is ffff8100201a0000, which is also> > the base pointer to the whole order-3 page ("Free pointer"), followed by> > the SLAB tracking data.>>> Is the POISON_FREE ("6b") region really contiguous Zdenek? The problem> here is that the object looks to be 29104 bytes that is subject to> kmalloc_large() which by-passes SLUB poisoning completely.> No - it was simply cutted - here is carefully selected same log withall different areas. Zdenek Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b6b 6b kkkkkkkkkkkkkkkk201a0010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a0ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a0bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a0bc0: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a0bd0: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk201a0be0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a0bf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a0fe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a0ff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÂ¥201a1000: cc cc cc cc cc cc cc cc d8 30 1a 20 00 81 ff ff ÃÃÃÃÃÃÃÃÃ0....ÿÿ201a1010: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ãp..ÿÿÿÿ....s...201a1020: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Ãq&.ÿÿÿÿ201a1030: 00 00 00 00 00 00 00 00 8c 54 58 00 01 00 00 00 .........TX.....201a1040: 5a 5a 5a 5a 5a 5a 5a 5a 6b 6b 6b 6b 6b 6b 6b 6b ZZZZZZZZkkkkkkkk201a1050: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a1be0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a1bf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a1c00: 6b 6b 6b 6b 6b 6b 6b 6b 01 00 00 00 00 00 00 00 kkkkkkkk........201a1c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a1c20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a1c30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a2020: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a2030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a2040: 6b 6b 6b 6b 6b 6b 6b a5 cc cc cc cc cc cc cc cc kkkkkkkÂ¥ÃÃÃÃÃÃÃÃ201a2050: 68 51 1a 20 00 81 ff ff cd 70 17 a0 ff ff ff ff hQ....ÿÿÃp..ÿÿÿÿ201a2060: 00 00 00 00 73 05 00 00 b6 54 58 00 01 00 00 00 ....s...¶TX.....201a2070: d5 71 26 81 ff ff ff ff 00 00 00 00 00 00 00 00 Ãq&.ÿÿÿÿ........201a2080: 97 54 58 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .TX.....ZZZZZZZZ201a2090: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a2c40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a2c50: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a2c60: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk201a2c70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a3070: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a3080: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÂ¥201a3090: cc cc cc cc cc cc cc cc 00 00 00 00 00 00 00 00 ÃÃÃÃÃÃÃÃ........201a30a0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ãp..ÿÿÿÿ....s...201a30b0: 5b 4c 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff [LX.....Ãq&.ÿÿÿÿ201a30c0: 00 00 00 00 7c 05 00 00 f3 4b 58 00 01 00 00 00 ....|...óKX.....201a30d0: 5a 5a 5a 5a 5a 5a 5a 5a 6b 6b 6b 6b 6b 6b 6b 6b ZZZZZZZZkkkkkkkk201a30e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a3c80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a3c90: 6b 6b 6b 6b 6b 6b 6b 6b 01 00 00 00 00 00 00 00 kkkkkkkk........201a3ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a3cb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a40c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a40d0: 6b 6b 6b 6b 6b 6b 6b a5 cc cc cc cc cc cc cc cc kkkkkkkÂ¥ÃÃÃÃÃÃÃÃ201a40e0: 20 41 1a 20 00 81 ff ff cd 70 17 a0 ff ff ff ff .A....ÿÿÃp..ÿÿÿÿ201a40f0: 00 00 00 00 73 05 00 00 b6 54 58 00 01 00 00 00 ....s...¶TX.....201a4100: d5 71 26 81 ff ff ff ff 00 00 00 00 7c 05 00 00 Ãq&.ÿÿÿÿ....|...201a4110: 87 54 58 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .TX.....ZZZZZZZZ201a4120: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a4cd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a4ce0: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a4cf0: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk201a4d00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a5100: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a5110: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÂ¥201a5120: cc cc cc cc cc cc cc cc 00 00 00 00 00 00 00 00 ÃÃÃÃÃÃÃÃ........201a5130: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ãp..ÿÿÿÿ....s...201a5140: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Ãq&.ÿÿÿÿ201a5150: 00 00 00 00 00 00 00 00 86 54 58 00 01 00 00 00 .........TX.....201a5160: 5a 5a 5a 5a 5a 5a 5a 5a 6b 6b 6b 6b 6b 6b 6b 6b ZZZZZZZZkkkkkkkk201a5170: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a5d10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a5d20: 6b 6b 6b 6b 6b 6b 6b 6b 01 00 00 00 00 00 00 00 kkkkkkkk........201a5d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a5d40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a6150: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a6160: 6b 6b 6b 6b 6b 6b 6b a5 cc cc cc cc cc cc cc cc kkkkkkkÂ¥ÃÃÃÃÃÃÃÃ201a6170: b0 61 1a 20 00 81 ff ff cd 70 17 a0 ff ff ff ff °a....ÿÿÃp..ÿÿÿÿ201a6180: 00 00 00 00 73 05 00 00 b6 54 58 00 01 00 00 00 ....s...¶TX.....201a6190: d5 71 26 81 ff ff ff ff 00 00 00 00 00 00 00 00 Ãq&.ÿÿÿÿ........201a61a0: 97 54 58 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .TX.....ZZZZZZZZ201a61b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a6d60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a6d70: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................201a6d80: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk201a6d90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 201a7190: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkkÂ¥201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÃÃÃÃÃÃÃÃ......ÿÿ201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Ãp..ÿÿÿÿ....s...201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Ãq&.ÿÿÿÿ201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX.....Padding 0xffff8100201a71f0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZFIX kmalloc-4096: Restoring 0xffff8100201a0000-0xffff8100201a7e16=0x5aÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 15:53 ` Linus Torvalds 2008-04-23 16:58 ` Pekka Enberg @ 2008-04-23 17:40 ` Ingo Molnar 2008-04-23 18:52 ` Pekka Enberg 2008-04-24 22:26 ` Jiri Slaby 3 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-23 17:40 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > CPU: L2 cache: 4096K > > CPU: Physical Processor ID: 0 > > CPU: Processor Core ID: 1 > > x86: PAT support disabled. > > SPIN IRQ ALREADY DISABLED > > Pid: 0, comm: swapper Not tainted 2.6.25 #57 > > > > Call Trace: > > [_spin_lock_irq+126/128] _spin_lock_irq+0x7e/0x80 > > [lock_ipi_call_lock+16/32] lock_ipi_call_lock+0x10/0x20 > > CPU1: Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz > > [start_secondary+68/206] start_secondary+0x44/0xce > > This is indeed an interesting issue: arch/x86/kernel/smpboot.c does an > IPI call to start_secondary, and yes, it looks suspicious to have that > lock_ipi_call_lock there (and in particular the unlock_ipi_call_lock > that enables interrupts within it). Ingo? hm, irqs already disabled isnt bad in itself and it happens all the time. The irq enabling in unlock_ipi_call_lock() should be OK. Any race with irqs there should at most result in a hung or crashed bootup, not in any memory corruption i believe. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 15:53 ` Linus Torvalds 2008-04-23 16:58 ` Pekka Enberg 2008-04-23 17:40 ` Ingo Molnar @ 2008-04-23 18:52 ` Pekka Enberg 2008-04-23 19:05 ` Christoph Lameter 2008-04-24 22:26 ` Jiri Slaby 3 siblings, 1 reply; 183+ messages in thread From: Pekka Enberg @ 2008-04-23 18:52 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Christoph Lameter Linus Torvalds wrote: > Looks like possibly a double free to me (with the first free caused the > page to be re-used, the second free is the one that triggers the debug > message). But maybe Pekka or Christoph are better at reading those oopses. > >> ============================================================================= >> BUG kmalloc-4096: Padding overwritten. 0x0000000000000000-0x00000000ffffffff >> ----------------------------------------------------------------------------- Okay, this doesn't make sense to me. The code does: u8 *start; u8 *fault; /* ... */ start = page_address(page); /* ... */ fault = check_bytes(start + length, POISON_INUSE, remainder); if (!fault) return 1; while (end > fault && end[-1] == POISON_INUSE) end--; slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1); So how come we're printing out 'fault' as zero and 'end' at 4 GB? Christoph? Zdenek, can you please send the full dmesg? Pekka ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 18:52 ` Pekka Enberg @ 2008-04-23 19:05 ` Christoph Lameter 2008-04-23 19:19 ` Pekka J Enberg 0 siblings, 1 reply; 183+ messages in thread From: Christoph Lameter @ 2008-04-23 19:05 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert On Wed, 23 Apr 2008, Pekka Enberg wrote: > fault = check_bytes(start + length, POISON_INUSE, remainder); fault == NULL if the check was successful. Otherwise it contains the first address that does not match our expectations. > if (!fault) > return 1; > while (end > fault && end[-1] == POISON_INUSE) > end--; > > slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1); > > So how come we're printing out 'fault' as zero and 'end' at 4 GB? Christoph? We should have returned from the function and not printed this message. If we somehow skipped the test for !fault then end could have wrapped around which gets us to 4GB. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 19:05 ` Christoph Lameter @ 2008-04-23 19:19 ` Pekka J Enberg 2008-04-23 19:28 ` Christoph Lameter 2008-04-23 20:27 ` Zdenek Kabelac 0 siblings, 2 replies; 183+ messages in thread From: Pekka J Enberg @ 2008-04-23 19:19 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert On Wed, 23 Apr 2008, Christoph Lameter wrote: > We should have returned from the function and not printed this message. If > we somehow skipped the test for !fault then end could have wrapped around > which gets us to 4GB. Aah, looks like it's just a silly bug in slab_fix(). If this looks ok to Christoph, can you re-test with this patch applied Zdenek? That way we'll actually know where SLUB expected to see POISON_INUSE. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> --- diff --git a/mm/slub.c b/mm/slub.c index 7f8aaa2..dac50e3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -456,6 +456,15 @@ static void print_page_info(struct page *page) } +static void __slab_bug(struct kmem_cache *s, char *buf) +{ + printk(KERN_ERR "========================================" + "=====================================\n"); + printk(KERN_ERR "BUG %s: %s\n", s->name, buf); + printk(KERN_ERR "----------------------------------------" + "-------------------------------------\n\n"); +} + static void slab_bug(struct kmem_cache *s, char *fmt, ...) { va_list args; @@ -464,11 +473,7 @@ static void slab_bug(struct kmem_cache *s, char *fmt, ...) va_start(args, fmt); vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); - printk(KERN_ERR "========================================" - "=====================================\n"); - printk(KERN_ERR "BUG %s: %s\n", s->name, buf); - printk(KERN_ERR "----------------------------------------" - "-------------------------------------\n\n"); + __slab_bug(s, buf); } static void slab_fix(struct kmem_cache *s, char *fmt, ...) @@ -533,7 +538,7 @@ static void slab_err(struct kmem_cache *s, struct page *page, char *fmt, ...) va_start(args, fmt); vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); - slab_bug(s, fmt); + __slab_bug(s, buf); print_page_info(page); dump_stack(); } ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 19:19 ` Pekka J Enberg @ 2008-04-23 19:28 ` Christoph Lameter 2008-04-23 20:27 ` Zdenek Kabelac 1 sibling, 0 replies; 183+ messages in thread From: Christoph Lameter @ 2008-04-23 19:28 UTC (permalink / raw) To: Pekka J Enberg Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert Or simpler (catching yet another case): Subject: slab_err: Pass parameters correctly to slab_bug Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/slub.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2008-04-23 12:24:02.000000000 -0700 +++ linux-2.6/mm/slub.c 2008-04-23 12:27:03.000000000 -0700 @@ -521,7 +521,7 @@ static void print_trailer(struct kmem_ca static void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason) { - slab_bug(s, reason); + slab_bug(s, "%s", reason); print_trailer(s, page, object); } @@ -533,7 +533,7 @@ static void slab_err(struct kmem_cache * va_start(args, fmt); vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); - slab_bug(s, fmt); + slab_bug(s, "%s", buf); print_page_info(page); dump_stack(); } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 19:19 ` Pekka J Enberg 2008-04-23 19:28 ` Christoph Lameter @ 2008-04-23 20:27 ` Zdenek Kabelac 1 sibling, 0 replies; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-23 20:27 UTC (permalink / raw) To: Pekka J Enberg Cc: Christoph Lameter, Linus Torvalds, Ingo Molnar, Jiri Slaby, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert [-- Attachment #1: Type: text/plain, Size: 810 bytes --] 2008/4/23, Pekka J Enberg <penberg@cs.helsinki.fi>: > On Wed, 23 Apr 2008, Christoph Lameter wrote: > > We should have returned from the function and not printed this message. If > > we somehow skipped the test for !fault then end could have wrapped around > > which gets us to 4GB. > > > Aah, looks like it's just a silly bug in slab_fix(). If this looks ok to > Christoph, can you re-test with this patch applied Zdenek? That way we'll > actually know where SLUB expected to see POISON_INUSE. Unfortunately it won't be easy to retest - I just know it happened to me with some wi-fi networking interaction after resume. I'll rebuild kernel with these slab patches - but I have now idea how to trigger the bug. In the attachment is bzip-ed dmesg in case it would be still needed for something. Zdenek [-- Attachment #2: messages.txt.bz2 --] [-- Type: application/x-bzip2, Size: 17363 bytes --] ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-23 15:53 ` Linus Torvalds ` (2 preceding siblings ...) 2008-04-23 18:52 ` Pekka Enberg @ 2008-04-24 22:26 ` Jiri Slaby 2008-04-24 22:41 ` Linus Torvalds 2008-04-25 1:35 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff David Miller 3 siblings, 2 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-24 22:26 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On 04/23/2008 05:53 PM, Linus Torvalds wrote: >> (SPIN LOCK already disabled is my personal trace ooops which is just >> checking if the spin_lock_irq is already called with disabled irq - in >> this place probably irqsave version should be used instead, otherwice >> it's not properly restored) > > Yes, that's interesting to see. And this too :/: #include <err.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define MAGIC 0xff00aa00deadcc22ULL int main() { unsigned int a, b, c = 0; unsigned long long *ch; while (1) { ch = malloc(1000000000); if (!ch) err(1, "malloc"); for (a = 0; a < 1000000000/sizeof(*ch); a++) ch[a] = MAGIC; printf("alloced %u\n", c); sleep(10); for (a = 0; a < 1000000000/sizeof(*ch); a++) if (ch[a] != MAGIC) { printf("WHAT THE HELL (%.8lx):\n", a * sizeof(*ch)); for (b = a - a % 10; b < (a - a % 10) + 100; b++) { printf("%.16llx ", ch[b]); if (!((b + 1) % 10)) puts(""); } exit(1); } free(ch); printf("freed %u\n", c); sleep(10); c++; } return 0; } 20 arpings running on wlan0 (don't know if this is related so far), suspend, resume, right after resume: freed 114 alloced 115 freed 115 alloced 116 WHAT THE HELL (000a3ff8): ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 Again those fuc*ing 0xf0s... Shouldn't be 2.6.25 uttered as broken until this is solved to not corrupt anyone's data? I'm going to play with that testing program further in the meantime. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-24 22:26 ` Jiri Slaby @ 2008-04-24 22:41 ` Linus Torvalds 2008-04-25 0:57 ` Jiri Slaby 2008-04-25 17:10 ` Jiri Slaby 2008-04-25 1:35 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff David Miller 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-24 22:41 UTC (permalink / raw) To: Jiri Slaby Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On Fri, 25 Apr 2008, Jiri Slaby wrote: > > 20 arpings running on wlan0 (don't know if this is related so far), suspend, > resume, right after resume: > > freed 114 > alloced 115 > freed 115 > alloced 116 > WHAT THE HELL (000a3ff8): > ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 Very interesting indeed. > Shouldn't be 2.6.25 uttered as broken until this is solved to not corrupt > anyone's data? I'm going to play with that testing program further in the > meantime. Do you actually see this with _plain_ 2.6.25? So far I've assumed that all the reports are about post-2.6.25 issues. Also, it does seem like you can re-create this at will in ways that others can not. Could you try to bisect it a bit? Right now we have no real clue what it is all about, except that it seems to be related to suspend/resume and there are some indications that it's about networking (and _perhaps_ wireless in particular). Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-24 22:41 ` Linus Torvalds @ 2008-04-25 0:57 ` Jiri Slaby 2008-04-24 23:45 ` Linus Torvalds 2008-04-25 17:10 ` Jiri Slaby 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 0:57 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On 04/25/2008 12:41 AM, Linus Torvalds wrote: > > On Fri, 25 Apr 2008, Jiri Slaby wrote: >> 20 arpings running on wlan0 (don't know if this is related so far), suspend, >> resume, right after resume: >> >> freed 114 >> alloced 115 >> freed 115 >> alloced 116 >> WHAT THE HELL (000a3ff8): >> ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 > > Very interesting indeed. > >> Shouldn't be 2.6.25 uttered as broken until this is solved to not corrupt >> anyone's data? I'm going to play with that testing program further in the >> meantime. > > Do you actually see this with _plain_ 2.6.25? So far I've assumed that all > the reports are about post-2.6.25 issues. Blah, sorry, I lived in a theory, that Rafael got one from 2.6.25 and no, really, he had -git2 applied. Mea culpa, anyway I'll test 2.6.25 too, just for sure. > Also, it does seem like you can re-create this at will in ways that others > can not. Could you try to bisect it a bit? Right now we have no real clue > what it is all about, except that it seems to be related to suspend/resume > and there are some indications that it's about networking (and _perhaps_ > wireless in particular). Not really. I have no idea what triggers it. Seems like suspend is some kind of catalyzer not working every time. I can't get it to crash for 1 whole day. Probably with the program (I hacked it a hour ago or so) it would be easier to reveal the error far before the crash itself. Will keep you informed. Now going to catch some sleep. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 0:57 ` Jiri Slaby @ 2008-04-24 23:45 ` Linus Torvalds 2008-04-25 7:36 ` Jiri Slaby 0 siblings, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-24 23:45 UTC (permalink / raw) To: Jiri Slaby Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On Fri, 25 Apr 2008, Jiri Slaby wrote: > > Blah, sorry, I lived in a theory, that Rafael got one from 2.6.25 and no, > really, he had -git2 applied. Mea culpa, anyway I'll test 2.6.25 too, just for > sure. No problem. I do think this is a post-2.6.25 thing, because I don't think we've ever seen it before that. There is a 2.6.25-rc8-mm2 report, but that was a -mm tree that had a lot of the stuff that was merged after 2.6.25, so I'm pretty sure that counts as "post" too. If it wasn't, we'd be seeing a lot more of this. > Not really. I have no idea what triggers it. Seems like suspend is some kind > of catalyzer not working every time. I don't think suspend/resume is sufficient, because I've tried to reproduce it here (and I tried your test program too) on my macmini, and it's not happening. So there almost certainly something else too required to trigger it. Btw, how do you suspend/resume? That matters, because I've been testing just the normal echo mem > /sys/power/state and with a kernel where everything is compiled-in. But if you use the GUI suspend, on a common distro, I think that one ends up doing a whole lot more, including doing things like unloading and reloading modules, and for all we know the problem is not about suspend itself, but about the things going on around it. And it might be a module unload issue, rather than the suspend itself (the same way I theorized that it might be a ifconfig down/up rather than the suspend code itself). Who knows.. It might also very well be hardware-specific. You guys seem to have different wireless setups (ath5k vs b43) but it migth be generic 80211 code, but it might also be something *totally* unrelated, and the only reason wireless has shown up might be that networking is just in use when the problem happens. Jiri, Zdenek, Rafael, could you try to compare hardware with each other and see if there is some pattern there? (And btw, the program you used that allocates a hundred meg and tries to find it - I'm assuming you're not paging or anythign like that, ie you're not even close to out-of-memory. If that isn't correct, holler. I'm trying to reproduce this thing). Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-24 23:45 ` Linus Torvalds @ 2008-04-25 7:36 ` Jiri Slaby 2008-04-25 14:09 ` Pavel Machek 2008-04-25 15:30 ` Rafael J. Wysocki 0 siblings, 2 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 7:36 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On 04/25/2008 01:45 AM, Linus Torvalds wrote: > On Fri, 25 Apr 2008, Jiri Slaby wrote: >> Not really. I have no idea what triggers it. Seems like suspend is some kind >> of catalyzer not working every time. > > I don't think suspend/resume is sufficient, because I've tried to > reproduce it here (and I tried your test program too) on my macmini, and > it's not happening. So there almost certainly something else too required > to trigger it. > > Btw, how do you suspend/resume? That matters, because I've been testing > just the normal > > echo mem > /sys/power/state > > and with a kernel where everything is compiled-in. But if you use the GUI > suspend, on a common distro, I think that one ends up doing a whole lot > more, including doing things like unloading and reloading modules, and for > all we know the problem is not about suspend itself, but about the things > going on around it. pm-suspend without suspend package -- i.e. it writes mem > state, but does some processing before and after that. However no module loads or removes. Particualry I have hibernate|suspend) service autofs stop >/dev/null service vmware stop >/dev/null ;; thaw|resume) service autofs start >/dev/null ;; While vmware is not running, autofs is. The rest of scripts is from http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/suse/x86_64/pm-utils-0.99.3.20070618-49.x86_64.rpm [I see now that suse added autofs stopping to their scripts too.] Not using networkmanager. Nothing in any pm confs, no VIDEO s3 quirks, no unload modules. No bluetooth, no pcmcia, no batteries, no cpufreq, no backlight. -- It's desktop. /proc/acpi/fan/*/state doesn't exist The probably only done handling is hwclock. lrwxrwxrwx 1 root root 0 Apr 25 02:44 /sys/class/rtc/rtc0/device/driver -> ../../../bus/pnp/drivers/rtc_cmos > Jiri, Zdenek, Rafael, could you try to compare hardware with each other > and see if there is some pattern there? 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express MEI Controller (rev 02) 00:03.1 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express MEI Controller (rev 02) 00:03.2 IDE interface: Intel Corporation 82G33/G31/P35/P31 Express PT IDER Controller (rev 02) 00:03.3 Serial controller: Intel Corporation 82G33/G31/P35/P31 Express Serial KT Controller (rev 02) 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02) 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02) 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02) 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02) 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02) 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02) 00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation Device 2910 (rev 02) 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) 00:1f.6 Signal processing controller: Intel Corporation 82801I (ICH9 Family) Thermal Subsystem (rev 02) 02:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200(A) PCI Express-to-PCI Bridge (rev 03) 03:00.0 FireWire (IEEE 1394): Texas Instruments XIO2200(A) IEEE-1394a-2000 Controller (PHY/Link) (rev 01) 04:00.0 Ethernet controller: Atheros Communications Inc. AR5212/AR5213 Multiprotocol MAC/baseband processor (rev 01) Bus 008 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 004 Device 007: ID 045e:00f0 Microsoft Corp. Bus 004 Device 006: ID 0458:004c KYE Systems Corp. (Mouse Systems) Slimstar Pro Keyboard Bus 004 Device 005: ID 04b4:2050 Cypress Semiconductor Corp. Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub core 2 duo, 2gigs of mem, 2 sata II disks, raid0, raid1 (both 0.9), lvm2, ext3 above all of it. Modules: Module Size Used by tun 11012 1 <----- Using vpn! bitrev 2240 1 tun ipv6 269736 36 arc4 2432 2 ecb 3584 2 crypto_blkcipher 18052 1 ecb cryptomgr 3712 0 crypto_algapi 15872 4 arc4,ecb,crypto_blkcipher,cryptomgr ath5k 104640 0 mac80211 140240 1 ath5k crc32 4416 2 tun,mac80211 sr_mod 15748 0 rtc_cmos 10232 0 rtc_core 17220 1 rtc_cmos floppy 64488 0 cfg80211 27920 2 ath5k,mac80211 cdrom 37800 1 sr_mod ohci1394 31412 0 rtc_lib 3328 1 rtc_core ieee1394 90808 1 ohci1394 evdev 11584 5 usbhid 49952 0 hid 73664 1 usbhid ff_memless 6088 1 usbhid ehci_hcd 37388 0 > (And btw, the program you used that allocates a hundred meg and tries to > find it - I'm assuming you're not paging or anythign like that, ie you're > not even close to out-of-memory. If that isn't correct, holler. I'm trying > to reproduce this thing). OOM is too far away: Swap: 2008084 32 2008052 Jiri ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:36 ` Jiri Slaby @ 2008-04-25 14:09 ` Pavel Machek 2008-04-25 15:30 ` Rafael J. Wysocki 1 sibling, 0 replies; 183+ messages in thread From: Pavel Machek @ 2008-04-25 14:09 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter Hi! > core 2 duo, 2gigs of mem, 2 sata II disks, raid0, raid1 (both 0.9), lvm2, > ext3 above all of it. > > Modules: > Module Size Used by > tun 11012 1 <----- Using vpn! > bitrev 2240 1 tun > ipv6 269736 36 > arc4 2432 2 > ecb 3584 2 > crypto_blkcipher 18052 1 ecb > cryptomgr 3712 0 > crypto_algapi 15872 4 arc4,ecb,crypto_blkcipher,cryptomgr > ath5k 104640 0 > mac80211 140240 1 ath5k > crc32 4416 2 tun,mac80211 > sr_mod 15748 0 > rtc_cmos 10232 0 > rtc_core 17220 1 rtc_cmos > floppy 64488 0 > cfg80211 27920 2 ath5k,mac80211 > cdrom 37800 1 sr_mod > ohci1394 31412 0 > rtc_lib 3328 1 rtc_core > ieee1394 90808 1 ohci1394 > evdev 11584 5 > usbhid 49952 0 > hid 73664 1 usbhid > ff_memless 6088 1 usbhid > ehci_hcd 37388 0 One useful trick is trying to boot with init=/bin/bash, and see if the problem persists. If it is gone, it is likely one of the modules, and those may be binary-searched -- which is often easier than bisect. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:36 ` Jiri Slaby 2008-04-25 14:09 ` Pavel Machek @ 2008-04-25 15:30 ` Rafael J. Wysocki 1 sibling, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-25 15:30 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, Zdenek Kabelac, Ingo Molnar, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On Friday, 25 of April 2008, Jiri Slaby wrote: > On 04/25/2008 01:45 AM, Linus Torvalds wrote: > > On Fri, 25 Apr 2008, Jiri Slaby wrote: > >> Not really. I have no idea what triggers it. Seems like suspend is some kind > >> of catalyzer not working every time. > > > > I don't think suspend/resume is sufficient, because I've tried to > > reproduce it here (and I tried your test program too) on my macmini, and > > it's not happening. So there almost certainly something else too required > > to trigger it. > > > > Btw, how do you suspend/resume? That matters, because I've been testing > > just the normal > > > > echo mem > /sys/power/state > > > > and with a kernel where everything is compiled-in. But if you use the GUI > > suspend, on a common distro, I think that one ends up doing a whole lot > > more, including doing things like unloading and reloading modules, and for > > all we know the problem is not about suspend itself, but about the things > > going on around it. > > pm-suspend without suspend package -- i.e. it writes mem > state, but does some > processing before and after that. However no module loads or removes. > > Particualry I have > hibernate|suspend) > service autofs stop >/dev/null > service vmware stop >/dev/null > ;; > thaw|resume) > service autofs start >/dev/null > ;; > > While vmware is not running, autofs is. > > The rest of scripts is from > http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/suse/x86_64/pm-utils-0.99.3.20070618-49.x86_64.rpm > > [I see now that suse added autofs stopping to their scripts too.] > Not using networkmanager. > Nothing in any pm confs, no VIDEO s3 quirks, no unload modules. > No bluetooth, no pcmcia, no batteries, no cpufreq, no backlight. -- It's desktop. > /proc/acpi/fan/*/state doesn't exist > > The probably only done handling is hwclock. > lrwxrwxrwx 1 root root 0 Apr 25 02:44 /sys/class/rtc/rtc0/device/driver -> > ../../../bus/pnp/drivers/rtc_cmos > > > Jiri, Zdenek, Rafael, could you try to compare hardware with each other > > and see if there is some pattern there? > > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller > (rev 02) > 00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express > Integrated Graphics Controller (rev 02) > 00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express > MEI Controller (rev 02) > 00:03.1 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express > MEI Controller (rev 02) > 00:03.2 IDE interface: Intel Corporation 82G33/G31/P35/P31 Express PT IDER > Controller (rev 02) > 00:03.3 Serial controller: Intel Corporation 82G33/G31/P35/P31 Express Serial KT > Controller (rev 02) > 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network > Connection (rev 02) > 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #4 (rev 02) > 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #5 (rev 02) > 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #6 (rev 02) > 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI > Controller #2 (rev 02) > 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller > (rev 02) > 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 > (rev 02) > 00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 > (rev 02) > 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #1 (rev 02) > 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #2 (rev 02) > 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI > Controller #3 (rev 02) > 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI > Controller #1 (rev 02) > 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) > 00:1f.0 ISA bridge: Intel Corporation Device 2910 (rev 02) > 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port > SATA AHCI Controller (rev 02) > 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) > 00:1f.6 Signal processing controller: Intel Corporation 82801I (ICH9 Family) > Thermal Subsystem (rev 02) > 02:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200(A) PCI Express-to-PCI > Bridge (rev 03) > 03:00.0 FireWire (IEEE 1394): Texas Instruments XIO2200(A) IEEE-1394a-2000 > Controller (PHY/Link) (rev 01) > 04:00.0 Ethernet controller: Atheros Communications Inc. AR5212/AR5213 > Multiprotocol MAC/baseband processor (rev 01) Well, my machine is based on Athlon 64 X2 with an ATI chipset. The only two common things it has with your machine is probably that we both use 64-bit kernels and wireless adapters (different ones, for that matter). I do use NetworkManager, BTW. Well, one thing that suspend does and which is not done routinely is CPU hotplugging. Could you please check if you are able to provoke the symptoms to appear by offlining-onlining CPU1? Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-24 22:41 ` Linus Torvalds 2008-04-25 0:57 ` Jiri Slaby @ 2008-04-25 17:10 ` Jiri Slaby 2008-04-25 9:13 ` David Miller 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 17:10 UTC (permalink / raw) To: Linus Torvalds Cc: Zdenek Kabelac, Ingo Molnar, Rafael J. Wysocki, paulmck, David Miller, Linux Kernel Mailing List, Andrew Morton, linux-ext4, herbert, Pekka Enberg, Christoph Lameter On 04/25/2008 12:41 AM, Linus Torvalds wrote: > Also, it does seem like you can re-create this at will in ways that others > can not. Could you try to bisect it a bit? Right now we have no real clue > what it is all about, except that it seems to be related to suspend/resume > and there are some indications that it's about networking (and _perhaps_ > wireless in particular). Yes! I'm able to reproduce it 90%! We can exclude X, 80211 and so ath5k too (I removed them from /lib/modules, so they won't ever load). I set the water mark at 1700 megabytes to allocate by the testing programs, so that it eats almost all free available memory. Then, I wait to "alloced 1" and then pm-suspend on second console. After resume, it spits out the corruption. I'm going to bisect it, will be back in few hours ;). ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 17:10 ` Jiri Slaby @ 2008-04-25 9:13 ` David Miller 2008-04-25 12:15 ` Zdenek Kabelac 2008-04-28 0:51 ` [PATCH 1/1] x86: fix text_poke Jiri Slaby 0 siblings, 2 replies; 183+ messages in thread From: David Miller @ 2008-04-25 9:13 UTC (permalink / raw) To: jirislaby Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter From: Jiri Slaby <jirislaby@gmail.com> Date: Fri, 25 Apr 2008 19:10:37 +0200 > I'm going to bisect it, will be back in few hours ;). Thanks for all of this hard work and investigation Jiri! ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 9:13 ` David Miller @ 2008-04-25 12:15 ` Zdenek Kabelac 2008-04-25 12:27 ` Zdenek Kabelac 2008-04-28 0:51 ` [PATCH 1/1] x86: fix text_poke Jiri Slaby 1 sibling, 1 reply; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-25 12:15 UTC (permalink / raw) To: David Miller Cc: jirislaby, torvalds, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter 2008/4/25, David Miller <davem@davemloft.net>: > From: Jiri Slaby <jirislaby@gmail.com> > > Date: Fri, 25 Apr 2008 19:10:37 +0200 > > > > I'm going to bisect it, will be back in few hours ;). > > > Thanks for all of this hard work and investigation Jiri! > Well just to show it's not happing only to Jiri: It's actually shows immediately on my box after suspend-resume... ./testf (Jiri's test code from this thread) alloced 0 WHAT THE HELL (015130f8): ff00aa00deadcc22 ff00aa00f0adcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 12:15 ` Zdenek Kabelac @ 2008-04-25 12:27 ` Zdenek Kabelac 0 siblings, 0 replies; 183+ messages in thread From: Zdenek Kabelac @ 2008-04-25 12:27 UTC (permalink / raw) To: David Miller Cc: jirislaby, torvalds, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter 2008/4/25, Zdenek Kabelac <zdenek.kabelac@gmail.com>: > 2008/4/25, David Miller <davem@davemloft.net>: > > > From: Jiri Slaby <jirislaby@gmail.com> > > > > Date: Fri, 25 Apr 2008 19:10:37 +0200 > > > > > > > I'm going to bisect it, will be back in few hours ;). > > > > > > Thanks for all of this hard work and investigation Jiri! > > > > > Well just to show it's not happing only to Jiri: > And now tested without the iwl wifi driver loaded: ./testf alloced 0 freed 0 alloced 1 freed 1 alloced 2 WHAT THE HELL (38cdabe8): ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 fff0aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00f0adcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 [kabi@localhost ~]$ lsmod Module Size Used by nls_iso8859_2 6592 1 nls_cp852 6848 1 vfat 15808 1 fat 60936 1 vfat i915 37768 2 drm 109024 3 i915 ipt_MASQUERADE 4800 1 iptable_nat 8144 1 nf_nat 23440 2 ipt_MASQUERADE,iptable_nat nf_conntrack_ipv4 20248 4 iptable_nat,nf_nat xt_state 3456 1 nf_conntrack 73072 5 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state ipt_REJECT 4992 2 xt_tcpudp 4288 4 iptable_filter 4736 1 ip_tables 22800 2 iptable_nat,iptable_filter x_tables 27872 6 ipt_MASQUERADE,iptable_nat,xt_state,ipt_REJECT,xt_tcpudp,ip_tables bridge 64504 0 llc 9584 1 bridge nfsd 283752 17 lockd 78960 1 nfsd nfs_acl 4672 1 nfsd auth_rpcgss 56448 1 nfsd exportfs 6656 1 nfsd autofs4 28320 2 sunrpc 234848 15 nfsd,lockd,nfs_acl,auth_rpcgss binfmt_misc 14604 1 dm_mirror 23864 0 dm_log 14080 1 dm_mirror dm_mod 74424 5 dm_mirror,dm_log uinput 11728 0 kvm_intel 30272 0 kvm 131376 1 kvm_intel snd_hda_intel 464644 3 snd_seq_oss 38544 0 snd_seq_midi_event 9800 1 snd_seq_oss snd_seq 64968 4 snd_seq_oss,snd_seq_midi_event snd_seq_device 10332 2 snd_seq_oss,snd_seq snd_pcm_oss 49504 0 snd_mixer_oss 19976 1 snd_pcm_oss snd_pcm 94512 2 snd_hda_intel,snd_pcm_oss snd_timer 29280 2 snd_seq,snd_pcm snd 78312 14 snd_hda_intel,snd_seq_oss,snd_seq,snd_seq_device,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer rtc_cmos 12856 0 rtc_core 24100 1 rtc_cmos evdev 15776 8 video 25692 0 psmouse 47084 0 thinkpad_acpi 66884 0 iTCO_wdt 15168 0 nvram 11272 2 thinkpad_acpi rtc_lib 4160 1 rtc_core serio_raw 8772 0 mmc_block 16080 2 soundcore 10848 1 snd backlight 7064 2 video,thinkpad_acpi i2c_i801 12124 0 i2c_core 29968 1 i2c_i801 intel_agp 32752 1 button 10528 0 sdhci 21260 0 mmc_core 57632 2 mmc_block,sdhci e1000e 111316 0 iTCO_vendor_support 5124 1 iTCO_wdt output 5184 1 video snd_page_alloc 12304 2 snd_hda_intel,snd_pcm battery 16656 0 ac 7816 0 uhci_hcd 29480 0 ohci_hcd 28180 0 ehci_hcd 43156 0 usbcore 177176 4 uhci_hcd,ohci_hcd,ehci_hcd ^ permalink raw reply [flat|nested] 183+ messages in thread
* [PATCH 1/1] x86: fix text_poke 2008-04-25 9:13 ` David Miller 2008-04-25 12:15 ` Zdenek Kabelac @ 2008-04-28 0:51 ` Jiri Slaby 2008-04-25 15:03 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-28 0:51 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Jiri Slaby, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar David Miller <davem@davemloft.net> wrote: > From: Jiri Slaby <jirislaby@gmail.com> > Date: Fri, 25 Apr 2008 19:10:37 +0200 > > > I'm going to bisect it, will be back in few hours ;). > > Thanks for all of this hard work and investigation Jiri! Thanks. Bisected mm down to git-x86.patch, bisected git-x86-latest down to x86: enhance DEBUG_RODATA support - alternatives The patch below fixes the problem for me. Comments welcome. The 0xf0 pattern comes from alternatives_smp_lock: text_poke(*ptr, ((unsigned char []){0xf0}), 1); I grepped for it a long time ago, but not in a form of coumpound literal :/. *Never* more :). -- kernel_text_address returns true even for modules which is not wanted in text_poke. Use core_kernel_text instead. This is a regression introduced in e587cadd8f47e202a30712e2906a65a0606d5865 which caused occasionaly crashes after suspend/resume. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Andi Kleen <andi@firstfloor.org> CC: pageexec@freemail.hu CC: H. Peter Anvin <hpa@zytor.com> CC: Jeremy Fitzhardinge <jeremy@goop.org> CC: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 5412fd7..0b074cb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -515,7 +515,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len) BUG_ON(len > sizeof(long)); BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { + if (core_kernel_text((unsigned long)addr)) { struct page *pages[2] = { virt_to_page(addr), virt_to_page(addr + PAGE_SIZE) }; if (!pages[1]) -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 0:51 ` [PATCH 1/1] x86: fix text_poke Jiri Slaby @ 2008-04-25 15:03 ` Linus Torvalds 2008-04-25 15:17 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 15:03 UTC (permalink / raw) To: Jiri Slaby Cc: David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar On Mon, 28 Apr 2008, Jiri Slaby wrote: > > Thanks. Bisected mm down to git-x86.patch, bisected git-x86-latest down to > x86: enhance DEBUG_RODATA support - alternatives > The patch below fixes the problem for me. Comments welcome. You're a hero, Jiri. And that also explains why I didn't see it - I don't do modules. Thanks a heap. > The 0xf0 pattern comes from alternatives_smp_lock: > text_poke(*ptr, ((unsigned char []){0xf0}), 1); And we should really add a lot more sanity checking there. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:03 ` Linus Torvalds @ 2008-04-25 15:17 ` Andi Kleen 2008-04-25 19:36 ` Christoph Lameter 2008-04-25 15:19 ` [PATCH 1/1] x86: fix text_poke Ingo Molnar 2008-04-25 20:18 ` David Miller 2 siblings, 1 reply; 183+ messages in thread From: Andi Kleen @ 2008-04-25 15:17 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar > And we should really add a lot more sanity checking there. A debug mode for virt_to_page(),__pa,__va et.al. would probably make sense and would have caught it. I used to have that partly in the x86-64 port with VIRTUAL_BUG_ON. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:17 ` Andi Kleen @ 2008-04-25 19:36 ` Christoph Lameter 2008-04-26 9:59 ` Andi Kleen 0 siblings, 1 reply; 183+ messages in thread From: Christoph Lameter @ 2008-04-25 19:36 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar On Fri, 25 Apr 2008, Andi Kleen wrote: > > And we should really add a lot more sanity checking there. > > A debug mode for virt_to_page(),__pa,__va et.al. would probably make sense > and would have caught it. > > I used to have that partly in the x86-64 port with VIRTUAL_BUG_ON. Good idea! Do you have a patch? ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 19:36 ` Christoph Lameter @ 2008-04-26 9:59 ` Andi Kleen 2008-04-26 11:16 ` Jiri Slaby 2008-04-28 20:24 ` VIRTUAL_BUG_ON() Christoph Lameter 0 siblings, 2 replies; 183+ messages in thread From: Andi Kleen @ 2008-04-26 9:59 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar On Fri, Apr 25, 2008 at 12:36:33PM -0700, Christoph Lameter wrote: > On Fri, 25 Apr 2008, Andi Kleen wrote: > > > > And we should really add a lot more sanity checking there. > > > > A debug mode for virt_to_page(),__pa,__va et.al. would probably make sense > > and would have caught it. > > > > I used to have that partly in the x86-64 port with VIRTUAL_BUG_ON. > > Good idea! Do you have a patch? Yes. Appended. But it just enables the old NUMA VIRTUAL_BUG_ON()s, more work could be done e.g. by instrumenting pa/va and the non NUMA and i386 case too. -Andi --- Add CONFIG option to enable VIRTUAL_BUG_ON() VIRTUAL_BUG_ON was used in the early days of x86-64 NUMA to debug the virtual address to struct page code. Later it was noped, but the call kept intact. Add a CONFIG option to enable it as a BUG_ON again. This would have likely caught the recent text_poke bug. Signed-off-by: Andi Kleen <andi@firstfloor.org> Index: linux/arch/x86/Kconfig.debug =================================================================== --- linux.orig/arch/x86/Kconfig.debug +++ linux/arch/x86/Kconfig.debug @@ -245,4 +245,11 @@ config CPA_DEBUG help Do change_page_attr() self-tests every 30 seconds. +config DEBUG_VIRTUAL + bool "Virtual memory translation debugging" + depends on DEBUG_KERNEL && NUMA && X86_64 + help + Enable some costly sanity checks in the NUMA virtual to page + code. This can catch mistakes with virt_to_page() and friends. + endmenu Index: linux/include/asm-x86/mmzone_64.h =================================================================== --- linux.orig/include/asm-x86/mmzone_64.h +++ linux/include/asm-x86/mmzone_64.h @@ -7,7 +7,11 @@ #ifdef CONFIG_NUMA +#ifdef CONFIG_DEBUG_VIRTUAL +#define VIRTUAL_BUG_ON(x) BUG_ON(x) +#else #define VIRTUAL_BUG_ON(x) +#endif #include <asm/smp.h> ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 9:59 ` Andi Kleen @ 2008-04-26 11:16 ` Jiri Slaby 2008-04-26 11:34 ` Andi Kleen 2008-04-28 20:24 ` VIRTUAL_BUG_ON() Christoph Lameter 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-26 11:16 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Linus Torvalds, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar On 04/26/2008 11:59 AM, Andi Kleen wrote: > On Fri, Apr 25, 2008 at 12:36:33PM -0700, Christoph Lameter wrote: >> On Fri, 25 Apr 2008, Andi Kleen wrote: >> >>>> And we should really add a lot more sanity checking there. >>> A debug mode for virt_to_page(),__pa,__va et.al. would probably make sense >>> and would have caught it. >>> >>> I used to have that partly in the x86-64 port with VIRTUAL_BUG_ON. >> Good idea! Do you have a patch? > > Yes. Appended. But it just enables the old NUMA VIRTUAL_BUG_ON()s, more > work could be done e.g. by instrumenting pa/va and the non NUMA and i386 > case too. Is anybody working on that? I would volunteer to do it. > --- linux.orig/include/asm-x86/mmzone_64.h > +++ linux/include/asm-x86/mmzone_64.h > @@ -7,7 +7,11 @@ > > #ifdef CONFIG_NUMA > > +#ifdef CONFIG_DEBUG_VIRTUAL > +#define VIRTUAL_BUG_ON(x) BUG_ON(x) > +#else > #define VIRTUAL_BUG_ON(x) > +#endif > > #include <asm/smp.h> ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 11:16 ` Jiri Slaby @ 2008-04-26 11:34 ` Andi Kleen 0 siblings, 0 replies; 183+ messages in thread From: Andi Kleen @ 2008-04-26 11:34 UTC (permalink / raw) To: Jiri Slaby Cc: Andi Kleen, Christoph Lameter, Linus Torvalds, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar > Is anybody working on that? I would volunteer to do it. Feel free to take it. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* VIRTUAL_BUG_ON() 2008-04-26 9:59 ` Andi Kleen 2008-04-26 11:16 ` Jiri Slaby @ 2008-04-28 20:24 ` Christoph Lameter 2008-05-01 19:22 ` [RFC 1/1] mm: add virt to phys debug Jiri Slaby 1 sibling, 1 reply; 183+ messages in thread From: Christoph Lameter @ 2008-04-28 20:24 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar On Sat, 26 Apr 2008, Andi Kleen wrote: > > Good idea! Do you have a patch? > > Yes. Appended. But it just enables the old NUMA VIRTUAL_BUG_ON()s, more > work could be done e.g. by instrumenting pa/va and the non NUMA and i386 > case too. Hmmmm.. No hooks yet? I have some pieces here that do something similar: Subject: Add checks for virtual addresses Add checks to insure that virtual addresses are not used in invalid contexts. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- arch/x86/mm/ioremap.c | 1 + include/asm-x86/page_32.h | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6.25-mm1/arch/x86/mm/ioremap.c =================================================================== --- linux-2.6.25-mm1.orig/arch/x86/mm/ioremap.c 2008-04-25 23:17:31.872390404 -0700 +++ linux-2.6.25-mm1/arch/x86/mm/ioremap.c 2008-04-25 23:37:43.202391820 -0700 @@ -25,6 +25,7 @@ unsigned long __phys_addr(unsigned long x) { + VM_BUG_ON(is_vmalloc_addr((void *)x)); if (x >= __START_KERNEL_map) return x - __START_KERNEL_map + phys_base; return x - PAGE_OFFSET; Index: linux-2.6.25-mm1/include/asm-x86/page_32.h =================================================================== --- linux-2.6.25-mm1.orig/include/asm-x86/page_32.h 2008-04-25 23:17:31.882389317 -0700 +++ linux-2.6.25-mm1/include/asm-x86/page_32.h 2008-04-25 23:37:43.202391820 -0700 @@ -64,7 +64,12 @@ typedef struct page *pgtable_t; #endif #ifndef __ASSEMBLY__ -#define __phys_addr(x) ((x) - PAGE_OFFSET) +static inline unsigned long __phys_addr(unsigned long x) +{ + VM_BUG_ON(is_vmalloc_addr((void *)x)); + return x - PAGE_OFFSET; +} + #define __phys_reloc_hide(x) RELOC_HIDE((x), 0) #ifdef CONFIG_FLATMEM ^ permalink raw reply [flat|nested] 183+ messages in thread
* [RFC 1/1] mm: add virt to phys debug 2008-04-28 20:24 ` VIRTUAL_BUG_ON() Christoph Lameter @ 2008-05-01 19:22 ` Jiri Slaby 2008-05-01 20:18 ` Christoph Lameter 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-05-01 19:22 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, Ingo Molnar, H. Peter Anvin, Jeremy Fitzhardinge, pageexec, Mathieu Desnoyers, herbert, penberg, akpm, linux-ext4, paulmck, rjw, zdenek.kabelac, David Miller, Linus Torvalds, linux-kernel, Jiri Slaby, Andi Kleen, Christoph Lameter [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 9192 bytes --] Christoph Lameter wrote: > --- linux-2.6.25-mm1.orig/include/asm-x86/page_32.h 2008-04-25 23:17:31.882389317 -0700 > > +++ linux-2.6.25-mm1/include/asm-x86/page_32.h 2008-04-25 23:37:43.202391820 -0700 > > @@ -64,8 +64,13 @@ > > typedef·struct·page·*pgtable_t; > #endif > #ifndef·__ASSEMBLY__ > -#define·__phys_addr(x)» » ((x)·-·PAGE_OFFSET) > +static·inline·unsigned·long·__phys_addr(unsigned·long·x) > +{ > + VM_BUG_ON(is_vmalloc_addr((void·*)x)); > + return·x·-·PAGE_OFFSET; > +} > + > #define·__phys_reloc_hide(x)» RELOC_HIDE((x),·0) > #ifdef·CONFIG_FLATMEM Christoph, was you able to compile this somehow? I had to move the code into ioremap along 64-bit variant to allow the checking. A pacth which I created is attached, I've successfully tested it by this module: static int init1(void) { static int data; struct module *mod = THIS_MODULE; char *k = (void *)PAGE_OFFSET; char *m = mod->module_core; char *sl = kmalloc(1000, GFP_KERNEL); char *pg = (void *)__get_free_page(GFP_KERNEL); char *rnd; printk(KERN_WARNING "OK\n"); printk(KERN_WARNING "%p -> %lx\n", &data, vmalloc_to_pfn(&data)); printk(KERN_WARNING "%p -> %lx\n", m, vmalloc_to_pfn(m)); printk(KERN_WARNING "%p -> %lx\n", k, virt_to_phys(k)); printk(KERN_WARNING "%p -> %lx\n", sl, virt_to_phys(sl)); printk(KERN_WARNING "%p -> %lx\n", pg, virt_to_phys(pg)); printk(KERN_WARNING "failing\n"); printk(KERN_WARNING "%p -> %lx\n", &data, virt_to_phys(&data)); printk(KERN_WARNING "%p -> %lx\n", m, virt_to_phys(m)); printk(KERN_WARNING "%p -> %lx\n", k, vmalloc_to_pfn(k)); printk(KERN_WARNING "%p -> %lx\n", sl, vmalloc_to_pfn(sl)); printk(KERN_WARNING "%p -> %lx\n", pg, vmalloc_to_pfn(pg)); #ifdef CONFIG_X86_64 rnd = (void *)0xffffc10000000000; printk(KERN_WARNING "%p -> %lx\n", rnd, vmalloc_to_pfn(rnd)); printk(KERN_WARNING "%p -> %lx\n", rnd, virt_to_phys(rnd)); rnd = (void *)0xffff800000000000; printk(KERN_WARNING "%p -> %lx\n", rnd, vmalloc_to_pfn(rnd)); printk(KERN_WARNING "%p -> %lx\n", rnd, virt_to_phys(rnd)); rnd = (void *)0xffffe2ffffffffff + 1; printk(KERN_WARNING "%p -> %lx\n", rnd, vmalloc_to_pfn(rnd)); printk(KERN_WARNING "%p -> %lx\n", rnd, virt_to_phys(rnd)); rnd = (void *)0xffffe20000000000; printk(KERN_WARNING "%p -> %lx\n", rnd, virt_to_phys(rnd)); #endif kfree(sl); free_page((ulong)pg); return -EIO; } Please comment. (At least if leave 2 debug macros or only single one.) -- Add some (configurable) expensive sanity checking to catch wrong address translations on x86. - create linux/mmdebug.h file to be able include this file in asm headers to not get unsolvable loops in header files - __phys_addr on x86_32 became a function in ioremap.c since PAGE_OFFSET and is_vmalloc_addr is undefined if declared in page_32.h (again circular dependencies) - add __phys_addr_const for initializing doublefault_tss.__cr3 Tested on 386, 386pae, x86_64 and x86_64 numa=fake=2. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <clameter@sgi.com> --- arch/x86/Kconfig.debug | 7 ------- arch/x86/kernel/doublefault_32.c | 2 +- arch/x86/mm/ioremap.c | 31 ++++++++++++++++++++++++------- include/asm-x86/mmzone_64.h | 6 +----- include/asm-x86/page_32.h | 3 ++- include/linux/mm.h | 7 +------ include/linux/mmdebug.h | 18 ++++++++++++++++++ lib/Kconfig.debug | 9 +++++++++ mm/vmalloc.c | 5 +++++ 9 files changed, 61 insertions(+), 27 deletions(-) create mode 100644 include/linux/mmdebug.h diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index 33b4388..6396ee0 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -258,13 +258,6 @@ config CPA_DEBUG help Do change_page_attr() self-tests every 30 seconds. -config DEBUG_VIRTUAL - bool "Virtual memory translation debugging" - depends on DEBUG_KERNEL && NUMA && X86_64 - help - Enable some costly sanity checks in the NUMA virtual to page - code. This can catch mistakes with virt_to_page() and friends. - endmenu config OPTIMIZE_INLINING diff --git a/arch/x86/kernel/doublefault_32.c b/arch/x86/kernel/doublefault_32.c index a47798b..395acb1 100644 --- a/arch/x86/kernel/doublefault_32.c +++ b/arch/x86/kernel/doublefault_32.c @@ -66,6 +66,6 @@ struct tss_struct doublefault_tss __cacheline_aligned = { .ds = __USER_DS, .fs = __KERNEL_PERCPU, - .__cr3 = __pa(swapper_pg_dir) + .__cr3 = __phys_addr_const((unsigned long)swapper_pg_dir) } }; diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 6d96353..5ead5a8 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -23,18 +23,26 @@ #ifdef CONFIG_X86_64 -unsigned long __phys_addr(unsigned long x) +static inline int phys_addr_valid(unsigned long addr) { - if (x >= __START_KERNEL_map) - return x - __START_KERNEL_map + phys_base; - return x - PAGE_OFFSET; + return addr < (1UL << boot_cpu_data.x86_phys_bits); } -EXPORT_SYMBOL(__phys_addr); -static inline int phys_addr_valid(unsigned long addr) +unsigned long __phys_addr(unsigned long x) { - return addr < (1UL << boot_cpu_data.x86_phys_bits); + if (x >= __START_KERNEL_map) { + x -= __START_KERNEL_map; + VIRTUAL_BUG_ON(x >= KERNEL_IMAGE_SIZE); + x += phys_base; + } else { + VIRTUAL_BUG_ON(x < PAGE_OFFSET); + x -= PAGE_OFFSET; + VIRTUAL_BUG_ON(system_state == SYSTEM_BOOTING ? x > MAXMEM : + !phys_addr_valid(x)); + } + return x; } +EXPORT_SYMBOL(__phys_addr); #else @@ -43,6 +51,15 @@ static inline int phys_addr_valid(unsigned long addr) return 1; } +unsigned long __phys_addr(unsigned long x) +{ + /* VMALLOC_* aren't constants; not available at the boot time */ + VIRTUAL_BUG_ON(x < PAGE_OFFSET || (system_state != SYSTEM_BOOTING && + is_vmalloc_addr((void *)x))); + return x - PAGE_OFFSET; +} +EXPORT_SYMBOL(__phys_addr); + #endif int page_is_ram(unsigned long pagenr) diff --git a/include/asm-x86/mmzone_64.h b/include/asm-x86/mmzone_64.h index 8e64d67..facde3e 100644 --- a/include/asm-x86/mmzone_64.h +++ b/include/asm-x86/mmzone_64.h @@ -7,11 +7,7 @@ #ifdef CONFIG_NUMA -#ifdef CONFIG_DEBUG_VIRTUAL -#define VIRTUAL_BUG_ON(x) BUG_ON(x) -#else -#define VIRTUAL_BUG_ON(x) -#endif +#include <linux/mmdebug.h> #include <asm/smp.h> diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h index 424e82f..9159bfb 100644 --- a/include/asm-x86/page_32.h +++ b/include/asm-x86/page_32.h @@ -64,7 +64,8 @@ typedef struct page *pgtable_t; #endif #ifndef __ASSEMBLY__ -#define __phys_addr(x) ((x) - PAGE_OFFSET) +#define __phys_addr_const(x) ((x) - PAGE_OFFSET) +extern unsigned long __phys_addr(unsigned long); #define __phys_reloc_hide(x) RELOC_HIDE((x), 0) #ifdef CONFIG_FLATMEM diff --git a/include/linux/mm.h b/include/linux/mm.h index 438ee65..5e002dc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -7,6 +7,7 @@ #include <linux/gfp.h> #include <linux/list.h> +#include <linux/mmdebug.h> #include <linux/mmzone.h> #include <linux/rbtree.h> #include <linux/prio_tree.h> @@ -210,12 +211,6 @@ struct inode; */ #include <linux/page-flags.h> -#ifdef CONFIG_DEBUG_VM -#define VM_BUG_ON(cond) BUG_ON(cond) -#else -#define VM_BUG_ON(condition) do { } while(0) -#endif - /* * Methods to modify the page usage count. * diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h new file mode 100644 index 0000000..860ed1a --- /dev/null +++ b/include/linux/mmdebug.h @@ -0,0 +1,18 @@ +#ifndef LINUX_MM_DEBUG_H +#define LINUX_MM_DEBUG_H 1 + +#include <linux/autoconf.h> + +#ifdef CONFIG_DEBUG_VM +#define VM_BUG_ON(cond) BUG_ON(cond) +#else +#define VM_BUG_ON(cond) do { } while(0) +#endif + +#ifdef CONFIG_DEBUG_VIRTUAL +#define VIRTUAL_BUG_ON(cond) BUG_ON(cond) +#else +#define VIRTUAL_BUG_ON(cond) do { } while(0) +#endif + +#endif diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index f75f6c1..eb643cb 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -472,6 +472,15 @@ config DEBUG_VM If unsure, say N. +config DEBUG_VIRTUAL + bool "Debug VM translations" + depends on DEBUG_KERNEL && X86 + help + Enable some costly sanity checks in virtual to page code. This can + catch mistakes with virt_to_page() and friends. + + If unsure, say N. + config DEBUG_WRITECOUNT bool "Debug filesystem writers count" depends on DEBUG_KERNEL diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 2a39cf1..c8172db 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -180,6 +180,11 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) pmd_t *pmd; pte_t *ptep, pte; + /* XXX we might need to change this if we add VIRTUAL_BUG_ON for + * architectures that do not vmalloc module space */ + VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) && + !is_module_address(addr)); + if (!pgd_none(*pgd)) { pud = pud_offset(pgd, addr); if (!pud_none(*pud)) { -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: [RFC 1/1] mm: add virt to phys debug 2008-05-01 19:22 ` [RFC 1/1] mm: add virt to phys debug Jiri Slaby @ 2008-05-01 20:18 ` Christoph Lameter 2008-05-06 21:54 ` Jiri Slaby 2008-05-13 14:38 ` Jiri Slaby 0 siblings, 2 replies; 183+ messages in thread From: Christoph Lameter @ 2008-05-01 20:18 UTC (permalink / raw) To: Jiri Slaby Cc: linux-mm, Ingo Molnar, H. Peter Anvin, Jeremy Fitzhardinge, pageexec, Mathieu Desnoyers, herbert, penberg, akpm, linux-ext4, paulmck, rjw, zdenek.kabelac, David Miller, Linus Torvalds, linux-kernel, Andi Kleen On Thu, 1 May 2008, Jiri Slaby wrote: > Christoph, was you able to compile this somehow? I had to move the code > into ioremap along 64-bit variant to allow the checking. The 64 bit piece works fine here and I used it for debugging the vmalloc work. Not sure about the 32 bit piece. > A pacth which I created is attached, I've successfully tested it by this > module: Great! Someone else picks this up. You can probably do a more thorough job than I can. > Add some (configurable) expensive sanity checking to catch wrong address > translations on x86. > > - create linux/mmdebug.h file to be able include this file in > asm headers to not get unsolvable loops in header files > - __phys_addr on x86_32 became a function in ioremap.c since > PAGE_OFFSET and is_vmalloc_addr is undefined if declared in > page_32.h (again circular dependencies) > - add __phys_addr_const for initializing doublefault_tss.__cr3 Hmmm.. We could use include/linux/bounds.h to make VMALLOC_START/VMALLOC_END (or whatever you need for checking the memory boundaries) a cpp constant which may allow the use in page_32.h without circular dependencies. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [RFC 1/1] mm: add virt to phys debug 2008-05-01 20:18 ` Christoph Lameter @ 2008-05-06 21:54 ` Jiri Slaby 2008-05-07 17:30 ` Christoph Lameter 2008-05-13 14:38 ` Jiri Slaby 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-05-06 21:54 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, Ingo Molnar, H. Peter Anvin, Jeremy Fitzhardinge, pageexec, Mathieu Desnoyers, herbert, penberg, akpm, linux-ext4, paulmck, rjw, zdenek.kabelac, David Miller, Linus Torvalds, linux-kernel, Andi Kleen On 05/01/2008 10:18 PM, Christoph Lameter wrote: > On Thu, 1 May 2008, Jiri Slaby wrote: >> Add some (configurable) expensive sanity checking to catch wrong address >> translations on x86. >> >> - create linux/mmdebug.h file to be able include this file in >> asm headers to not get unsolvable loops in header files >> - __phys_addr on x86_32 became a function in ioremap.c since >> PAGE_OFFSET and is_vmalloc_addr is undefined if declared in >> page_32.h (again circular dependencies) >> - add __phys_addr_const for initializing doublefault_tss.__cr3 > > Hmmm.. We could use include/linux/bounds.h to make > VMALLOC_START/VMALLOC_END (or whatever you need for checking the memory > boundaries) a cpp constant which may allow the use in page_32.h without > circular dependencies. I like the idea, I'll get back with a patch in few days (sorry, too busy). Anyway bounds.h should be include/asm/ thing though. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [RFC 1/1] mm: add virt to phys debug 2008-05-06 21:54 ` Jiri Slaby @ 2008-05-07 17:30 ` Christoph Lameter 0 siblings, 0 replies; 183+ messages in thread From: Christoph Lameter @ 2008-05-07 17:30 UTC (permalink / raw) To: Jiri Slaby Cc: linux-mm, Ingo Molnar, H. Peter Anvin, Jeremy Fitzhardinge, pageexec, Mathieu Desnoyers, herbert, penberg, akpm, linux-ext4, paulmck, rjw, zdenek.kabelac, David Miller, Linus Torvalds, linux-kernel, Andi Kleen On Tue, 6 May 2008, Jiri Slaby wrote: > I like the idea, I'll get back with a patch in few days (sorry, too busy). > Anyway bounds.h should be include/asm/ thing though. For arch specific stuff use asm-offsets.h. It would have to be included in page_xx.h. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [RFC 1/1] mm: add virt to phys debug 2008-05-01 20:18 ` Christoph Lameter 2008-05-06 21:54 ` Jiri Slaby @ 2008-05-13 14:38 ` Jiri Slaby 1 sibling, 0 replies; 183+ messages in thread From: Jiri Slaby @ 2008-05-13 14:38 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, Ingo Molnar, H. Peter Anvin, Jeremy Fitzhardinge, pageexec, Mathieu Desnoyers, herbert, penberg, akpm, linux-ext4, paulmck, linux-kernel, Andi Kleen Christoph Lameter napsal(a): > On Thu, 1 May 2008, Jiri Slaby wrote: >> Add some (configurable) expensive sanity checking to catch wrong address >> translations on x86. >> >> - create linux/mmdebug.h file to be able include this file in >> asm headers to not get unsolvable loops in header files >> - __phys_addr on x86_32 became a function in ioremap.c since >> PAGE_OFFSET and is_vmalloc_addr is undefined if declared in >> page_32.h (again circular dependencies) >> - add __phys_addr_const for initializing doublefault_tss.__cr3 > > Hmmm.. We could use include/linux/bounds.h to make > VMALLOC_START/VMALLOC_END (or whatever you need for checking the memory > boundaries) a cpp constant which may allow the use in page_32.h without > circular dependencies. Hrm, not that easy. I ended up in splitting fixmap_32.h (VMALLOC constants depends on it on 32-bit), moving around constants from over all the tree (NR_CPUS, FIX_ACPI_PAGES...) to not include files which would create loops, but still not having e.g. PMD_MASK available on all configurations. I think it's not worth it. Objections to merging the patch as was (http://lkml.org/lkml/2008/5/1/300)? Thanks. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:03 ` Linus Torvalds 2008-04-25 15:17 ` Andi Kleen @ 2008-04-25 15:19 ` Ingo Molnar 2008-04-25 15:26 ` Ingo Molnar 2008-04-25 15:27 ` Andi Kleen 2008-04-25 20:18 ` David Miller 2 siblings, 2 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 15:19 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 28 Apr 2008, Jiri Slaby wrote: > > > > Thanks. Bisected mm down to git-x86.patch, bisected git-x86-latest > > down to x86: enhance DEBUG_RODATA support - alternatives The patch > > below fixes the problem for me. Comments welcome. > > You're a hero, Jiri. indeed! > And that also explains why I didn't see it - I don't do modules. neither does my auto-test :-/ Suspend/resume goes from SMP to UP and then back - and triggers all the instrument patching code. I suspect we should/could have seen similar problems with a pure CPU hotplug stress-test, on a modular kernel. > Thanks a heap. > > > The 0xf0 pattern comes from alternatives_smp_lock: text_poke(*ptr, > > ((unsigned char []){0xf0}), 1); > > And we should really add a lot more sanity checking there. yeah. incidentally, this bug was fixed by Mathieu yesterday but the full impact of the bug was not realized. Below is that patch from sched-devel. i'm wondering what the best sanity checking would be. What we want is to be sure the patch we modify is truly a kernel or module text page. Perhaps we should start marking all kernel/module text pages with PageReserved? That way we can not corrupt any userspace/pagecache page. (and we'd clear PageReserved on module unload) Ingo -------------------------> Subject: Fix sched-devel text_poke From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Date: Thu, 24 Apr 2008 11:03:33 -0400 Use core_text_address() instead of kernel_text_address(). Deal with modules in the same way used for the core kernel. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) Index: linux/arch/x86/kernel/alternative.c =================================================================== --- linux.orig/arch/x86/kernel/alternative.c +++ linux/arch/x86/kernel/alternative.c @@ -511,31 +511,29 @@ void *__kprobes text_poke(void *addr, co unsigned long flags; char *vaddr; int nr_pages = 2; + struct page *pages[2]; + int i; - BUG_ON(len > sizeof(long)); - BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { - struct page *pages[2] = { virt_to_page(addr), - virt_to_page(addr + PAGE_SIZE) }; - if (!pages[1]) - nr_pages = 1; - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - BUG_ON(!vaddr); - local_irq_save(flags); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - local_irq_restore(flags); - vunmap(vaddr); + if (!core_kernel_text((unsigned long)addr)) { + pages[0] = vmalloc_to_page(addr); + pages[1] = vmalloc_to_page(addr + PAGE_SIZE); } else { - /* - * modules are in vmalloc'ed memory, always writable. - */ - local_irq_save(flags); - memcpy(addr, opcode, len); - local_irq_restore(flags); + pages[0] = virt_to_page(addr); + pages[1] = virt_to_page(addr + PAGE_SIZE); } + BUG_ON(!pages[0]); + if (!pages[1]) + nr_pages = 1; + vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + BUG_ON(!vaddr); + local_irq_save(flags); + memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); + local_irq_restore(flags); + vunmap(vaddr); sync_core(); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ + for (i = 0; i < len; i++) + BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]); return addr; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:19 ` [PATCH 1/1] x86: fix text_poke Ingo Molnar @ 2008-04-25 15:26 ` Ingo Molnar 2008-04-25 15:32 ` Ingo Molnar 2008-04-25 15:33 ` Linus Torvalds 2008-04-25 15:27 ` Andi Kleen 1 sibling, 2 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 15:26 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge > > > The 0xf0 pattern comes from alternatives_smp_lock: text_poke(*ptr, > > > ((unsigned char []){0xf0}), 1); > > > > And we should really add a lot more sanity checking there. something like the patch below? (untested) Ingo ---------------> Subject: harden kernel code patching From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 17:07:03 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 5 +++++ mm/vmalloc.c | 3 +++ 2 files changed, 8 insertions(+) Index: linux/arch/x86/kernel/alternative.c =================================================================== --- linux.orig/arch/x86/kernel/alternative.c +++ linux/arch/x86/kernel/alternative.c @@ -518,6 +518,11 @@ void *__kprobes text_poke(void *addr, co if (core_kernel_text((unsigned long)addr)) { struct page *pages[2] = { virt_to_page(addr), virt_to_page(addr + PAGE_SIZE) }; + /* + * Module text pages are PageReserved: + */ + WARN_ON(pages[0] && !PageReserved(pages[0])) + WARN_ON(pages[1] && !PageReserved(pages[1])) if (!pages[1]) nr_pages = 1; vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); Index: linux/mm/vmalloc.c =================================================================== --- linux.orig/mm/vmalloc.c +++ linux/mm/vmalloc.c @@ -391,6 +391,7 @@ static void __vunmap(const void *addr, i struct page *page = area->pages[i]; BUG_ON(!page); + ClearPageReserved(page); __free_page(page); } @@ -507,6 +508,8 @@ static void *__vmalloc_area_node(struct area->nr_pages = i; goto fail; } + if (prot == PAGE_KERNEL_EXEC) + SetPageReserved(page); area->pages[i] = page; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:26 ` Ingo Molnar @ 2008-04-25 15:32 ` Ingo Molnar 2008-04-25 15:33 ` Linus Torvalds 1 sibling, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 15:32 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Ingo Molnar <mingo@elte.hu> wrote: > > > > The 0xf0 pattern comes from alternatives_smp_lock: text_poke(*ptr, > > > > ((unsigned char []){0xf0}), 1); > > > > > > And we should really add a lot more sanity checking there. > > something like the patch below? (untested) the one below even builds and boots. this assumes that all modules areas are allocated via PAGE_KERNEL_EXEC - but that is generally true on x86 due to NX. 32-bit uses vmalloc_exec(), 64-bit uses __vmalloc_area(..., PAGE_KERNEL_EXEC). Jiri ... if you have any desire/stamina to still test this code - does the patch below produce any warnings if you unapply your fix as well, during suspend/resume? Ingo ---------------> Subject: x86: harden kernel code patching From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 17:07:03 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 5 +++++ mm/vmalloc.c | 3 +++ 2 files changed, 8 insertions(+) Index: linux/arch/x86/kernel/alternative.c =================================================================== --- linux.orig/arch/x86/kernel/alternative.c +++ linux/arch/x86/kernel/alternative.c @@ -518,6 +518,11 @@ void *__kprobes text_poke(void *addr, co if (core_kernel_text((unsigned long)addr)) { struct page *pages[2] = { virt_to_page(addr), virt_to_page(addr + PAGE_SIZE) }; + /* + * Module text pages are PageReserved: + */ + WARN_ON(pages[0] && !PageReserved(pages[0])); + WARN_ON(pages[1] && !PageReserved(pages[1])); if (!pages[1]) nr_pages = 1; vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); Index: linux/mm/vmalloc.c =================================================================== --- linux.orig/mm/vmalloc.c +++ linux/mm/vmalloc.c @@ -391,6 +391,7 @@ static void __vunmap(const void *addr, i struct page *page = area->pages[i]; BUG_ON(!page); + ClearPageReserved(page); __free_page(page); } @@ -507,6 +508,8 @@ static void *__vmalloc_area_node(struct area->nr_pages = i; goto fail; } + if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL_EXEC)) + SetPageReserved(page); area->pages[i] = page; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:26 ` Ingo Molnar 2008-04-25 15:32 ` Ingo Molnar @ 2008-04-25 15:33 ` Linus Torvalds 2008-04-25 15:48 ` Andi Kleen ` (2 more replies) 1 sibling, 3 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 15:33 UTC (permalink / raw) To: Ingo Molnar Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > something like the patch below? (untested) No. That whole code sequence is total and utter crap. It needs to be rewritten. It first does a BUG_ON() if it's not naturally aligned (because that wouldn't be atomic), and then it has code for page crossing! What a TOTAL PIECE OF SH*T! Hint: - if it's naturally aligned, it couldn't be page crossing ANYWAY - and if it was a page-crosser, it sure as hell couldn't be atomic! The code is just crap, crap, crap. It needs to be rewritten from scratch. I'll have a patch soonish. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:33 ` Linus Torvalds @ 2008-04-25 15:48 ` Andi Kleen 2008-04-25 16:06 ` Linus Torvalds 2008-04-25 15:50 ` Ingo Molnar 2008-04-25 15:54 ` Mathieu Desnoyers 2 siblings, 1 reply; 183+ messages in thread From: Andi Kleen @ 2008-04-25 15:48 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge > - if it's naturally aligned, it couldn't be page crossing ANYWAY > - and if it was a page-crosser, it sure as hell couldn't be atomic! With the current code it doesn't need to be atomic anyways because all patching is done with other CPUs stopped, except for kprobes but those only ever write a single byte. So all these checks can be just removed. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:48 ` Andi Kleen @ 2008-04-25 16:06 ` Linus Torvalds 2008-04-25 16:19 ` Andi Kleen 2008-04-25 16:22 ` Ingo Molnar 0 siblings, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 16:06 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Andi Kleen wrote: > > So all these checks can be just removed. Quite frankly, I'd rather tighten them up. All the callers actually seem to do just a single-byte one. So I'd suggest really tightening it up to require total natural alignment (rather than the weaker version that required that it fit in an aligned unsigned long or whatever). And I'd suggest using FIXMAP's instead of vmap. Maybe something like the appended (TOTALLY UNTESTED!) Linus --- arch/x86/kernel/alternative.c | 32 ++++++++++++++++---------------- include/asm-x86/fixmap_32.h | 1 + include/asm-x86/fixmap_64.h | 1 + 3 files changed, 18 insertions(+), 16 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index df4099d..6172e40 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -508,24 +508,24 @@ void *text_poke_early(void *addr, const void *opcode, size_t len) */ void *__kprobes text_poke(void *addr, const void *opcode, size_t len) { - unsigned long flags; - char *vaddr; - int nr_pages = 2; + static DEFINE_SPINLOCK(poke_lock); + unsigned long flags, bits; + bits = (unsigned long) addr; BUG_ON(len > sizeof(long)); - BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { - struct page *pages[2] = { virt_to_page(addr), - virt_to_page(addr + PAGE_SIZE) }; - if (!pages[1]) - nr_pages = 1; - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - BUG_ON(!vaddr); - local_irq_save(flags); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - local_irq_restore(flags); - vunmap(vaddr); + BUG_ON(len & (len-1)); + BUG_ON(bits & (len-1)); + + if (core_kernel_text(bits)) { + unsigned long phys = __pa(addr); + unsigned long offset = phys & ~PAGE_MASK; + unsigned long virt = fix_to_virt(FIX_POKE); + phys &= PAGE_MASK; + + spin_lock_irqsave(&poke_lock, flags); + set_fixmap(FIX_POKE, phys); + memcpy((void *)(virt + offset), opcode, len); + spin_unlock_irqrestore(&poke_lock, flags); } else { /* * modules are in vmalloc'ed memory, always writable. diff --git a/include/asm-x86/fixmap_32.h b/include/asm-x86/fixmap_32.h index eb16651..1f6df95 100644 --- a/include/asm-x86/fixmap_32.h +++ b/include/asm-x86/fixmap_32.h @@ -55,6 +55,7 @@ enum fixed_addresses { FIX_HOLE, FIX_VDSO, FIX_DBGP_BASE, + FIX_POKE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_X86_LOCAL_APIC FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */ diff --git a/include/asm-x86/fixmap_64.h b/include/asm-x86/fixmap_64.h index f3d7685..75e6004 100644 --- a/include/asm-x86/fixmap_64.h +++ b/include/asm-x86/fixmap_64.h @@ -37,6 +37,7 @@ enum fixed_addresses { VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE + ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1, VSYSCALL_HPET, + FIX_POKE, FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, FIX_HPET_BASE, ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:06 ` Linus Torvalds @ 2008-04-25 16:19 ` Andi Kleen 2008-04-25 16:24 ` Linus Torvalds 2008-04-25 16:30 ` Mathieu Desnoyers 2008-04-25 16:22 ` Ingo Molnar 1 sibling, 2 replies; 183+ messages in thread From: Andi Kleen @ 2008-04-25 16:19 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, Apr 25, 2008 at 09:06:37AM -0700, Linus Torvalds wrote: > > > On Fri, 25 Apr 2008, Andi Kleen wrote: > > > > So all these checks can be just removed. > > Quite frankly, I'd rather tighten them up. All the callers actually seem > to do just a single-byte one. I think Mathieu did them to prepare for his immediate values which need to write more bytes (although actually it would be quite possible to have immediate values only for byte immediates too) But that code needs much more infrastructure anyways. > > So I'd suggest really tightening it up to require total natural alignment For the common (everything but kprobes) "other code not running" it doesn't matter and I don't think natural alignment works for the other cases anyways. FWIW the original text_poke I started long ago only did bytes > (rather than the weaker version that required that it fit in an aligned > unsigned long or whatever). And I'd suggest using FIXMAP's instead of > vmap. Maybe something like the appended (TOTALLY UNTESTED!) Not sure how the fixmap is better. It's pretty much equivalent, isn't it? Perhaps a little cheaper, but the code shouldn't be performance critical. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:19 ` Andi Kleen @ 2008-04-25 16:24 ` Linus Torvalds 2008-04-25 16:33 ` Ingo Molnar 2008-04-25 18:13 ` Jeremy Fitzhardinge 2008-04-25 16:30 ` Mathieu Desnoyers 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 16:24 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Andi Kleen wrote: > > Not sure how the fixmap is better. It's pretty much equivalent, isn't it? > Perhaps a little cheaper, but the code shouldn't be performance critical. I have no really strong opinions. However, we do have a *lot* of lock prefixes in the kernel, and fixmaps are a lot cheaper than vmap(). It may not be performance-critical, but for me the "locks" section for the kernel is 0x8060 bytes long, which would seem to say that this is called four thousand times for each suspend and resume. With each invocation being thousands of instructions and a cross-CPU IPI for the tlb flush, that kind of stuff adds up. We're likely talking real fractions of a second, rather than milliseconds. But no, I didn't time it or really think very deeply about it. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:24 ` Linus Torvalds @ 2008-04-25 16:33 ` Ingo Molnar 2008-04-25 18:13 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 16:33 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Not sure how the fixmap is better. It's pretty much equivalent, > > isn't it? Perhaps a little cheaper, but the code shouldn't be > > performance critical. > > I have no really strong opinions. However, we do have a *lot* of lock > prefixes in the kernel, and fixmaps are a lot cheaper than vmap(). It > may not be performance-critical, but for me the "locks" section for > the kernel is 0x8060 bytes long, which would seem to say that this is > called four thousand times for each suspend and resume. > > With each invocation being thousands of instructions and a cross-CPU > IPI for the tlb flush, that kind of stuff adds up. We're likely > talking real fractions of a second, rather than milliseconds. the other thing is atomicity - your new version of text_poke() is evidently atomic - while vmap() does a kmalloc which might sleep. Atomicity for something as fragile as code-patching never hurts, so i definitely like your version more. it's also the more familar API - set_fixmap() is used more frequently than vmap() - hence less danger of doing something wrong. in fact i'd do the extra sanity check below as well on top of your patch - all core kernel text pages are PageReserved so the one below would have caught the memory corruption right at its source. Ingo ----------------> Subject: x86: harden kernel code patching From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 17:07:03 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux/arch/x86/kernel/alternative.c =================================================================== --- linux.orig/arch/x86/kernel/alternative.c +++ linux/arch/x86/kernel/alternative.c @@ -522,6 +522,8 @@ void *__kprobes text_poke(void *addr, co unsigned long virt = fix_to_virt(FIX_POKE); phys &= PAGE_MASK; + WARN_ON(!PageReserved(virt_to_page(addr))); + spin_lock_irqsave(&poke_lock, flags); set_fixmap(FIX_POKE, phys); memcpy((void *)(virt + offset), opcode, len); ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:24 ` Linus Torvalds 2008-04-25 16:33 ` Ingo Molnar @ 2008-04-25 18:13 ` Jeremy Fitzhardinge 2008-05-05 2:36 ` Nick Piggin 1 sibling, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-25 18:13 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Nick Piggin Linus Torvalds wrote: > With each invocation being thousands of instructions and a cross-CPU IPI > for the tlb flush, that kind of stuff adds up. We're likely talking real > fractions of a second, rather than milliseconds. Doesn't vunmap batch the cross-CPU tlb flushes to amortize the cost? Hm, no, it doesn't seem to. Oh, right, it was one of Nick's TBDs. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:13 ` Jeremy Fitzhardinge @ 2008-05-05 2:36 ` Nick Piggin 0 siblings, 0 replies; 183+ messages in thread From: Nick Piggin @ 2008-05-05 2:36 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin On Saturday 26 April 2008 04:13, Jeremy Fitzhardinge wrote: > Linus Torvalds wrote: > > With each invocation being thousands of instructions and a cross-CPU IPI > > for the tlb flush, that kind of stuff adds up. We're likely talking real > > fractions of a second, rather than milliseconds. > > Doesn't vunmap batch the cross-CPU tlb flushes to amortize the cost? > Hm, no, it doesn't seem to. Oh, right, it was one of Nick's TBDs. Yeah and I do have patches... posted a few months ago IIRC. I forget offhand the exact details of the batching, but yes it should batch this case of Linus's. I think I set it to batch 1024 vunmaps or xxxxKB per IPI flush. With those patches it basically removed IPI flushing completely from profiles of vmap intensive workloads. The problem with vmap (outside this particular issue of text poking, which should be single-threaded anyway), really is that it is a single threaded allocator. The locking actually ends up hurting much more than the IPIs for non-trivial uses of vmap (like xfs with directories larger than PAGE_SIZE). Fortunately I have patches for that too ;) ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:19 ` Andi Kleen 2008-04-25 16:24 ` Linus Torvalds @ 2008-04-25 16:30 ` Mathieu Desnoyers 2008-04-25 16:42 ` H. Peter Anvin 1 sibling, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 16:30 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Andi Kleen (andi@firstfloor.org) wrote: > On Fri, Apr 25, 2008 at 09:06:37AM -0700, Linus Torvalds wrote: > > > > > > On Fri, 25 Apr 2008, Andi Kleen wrote: > > > > > > So all these checks can be just removed. > > > > Quite frankly, I'd rather tighten them up. All the callers actually seem > > to do just a single-byte one. > > I think Mathieu did them to prepare for his immediate values which > need to write more bytes (although actually it would be quite > possible to have immediate values only for byte immediates too) > > But that code needs much more infrastructure anyways. > Yes, the immediate values, in general, only need to do atomic writes, because I have taken care of placing the mov instruction in the correct alignment so its immediate value happens to be aligned in memory. However, the latest optimisation I did to change a conditional branch into a jump when the correct code pattern is detected : mov, test, bne short into a nop2, nop2, nop1, jmp short or mov, test, bne near into a nop2, nop2, nop1, jmp near "replace_instruction_safe" is used for that. It puts a breakpoint in lieue of each instruction's first byte before changing the rest of the (potentially non aligned) instruction non atomically, and only then, after issuing a sync_core on every CPUs to flush the trace cache, does it put back the first byte, so it's done safely wrt intel's erratas regarding code modification on SMP. Also note that it changes a 6 bytes branch instruction into a 1 byte nop + 5 byte jump in the near jump case, which is ok : you can split an instruction in multiple smaller instructions safely on a live system wrt any execution context, but the opposite is _not_ ok, since there could be a return address pointing in the middle of the grouped instructions sitting on some other kernel thread or interrupt stack (we should also take into account hypervisor interaction here). Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:30 ` Mathieu Desnoyers @ 2008-04-25 16:42 ` H. Peter Anvin 2008-04-25 17:09 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 16:42 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > Yes, the immediate values, in general, only need to do atomic writes, > because I have taken care of placing the mov instruction in the correct > alignment so its immediate value happens to be aligned in memory. > However, the latest optimisation I did to change a conditional branch > into a jump when the correct code pattern is detected : > > mov, test, bne short > into a > nop2, nop2, nop1, jmp short > > or > > mov, test, bne near > into a > nop2, nop2, nop1, jmp near > And how, pray tell, do you deal with the fact that: a) the EFLAGS may be live on exit; b) there might be a jump into the middle of this instruction sequence? -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:42 ` H. Peter Anvin @ 2008-04-25 17:09 ` Mathieu Desnoyers 2008-04-25 18:37 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 17:09 UTC (permalink / raw) To: H. Peter Anvin Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * H. Peter Anvin (hpa@zytor.com) wrote: > Mathieu Desnoyers wrote: >> Yes, the immediate values, in general, only need to do atomic writes, >> because I have taken care of placing the mov instruction in the correct >> alignment so its immediate value happens to be aligned in memory. >> However, the latest optimisation I did to change a conditional branch >> into a jump when the correct code pattern is detected : >> mov, test, bne short >> into a >> nop2, nop2, nop1, jmp short >> or >> mov, test, bne near >> into a >> nop2, nop2, nop1, jmp near > > And how, pray tell, do you deal with the fact that: > > a) the EFLAGS may be live on exit; Actually, not only EFLAGS can be live on exit, but also the immediate value itself. If we take the mov, test, jne short case into account, I force the mov to populate the %al register with some immediate value. Then, this value is extracted from the inline assembly and feeded to an if() c statement under the form of a variable. So, I check precisely for a mov %al,0, followed by test and bne. If I don't find it (due to gcc optimizations), then I leave the original immediate value there. I start the pattern matching from the address of the movb instruction, which I extract from the inline assembly. So, about the EFLAGS : given that I first change the jne for an unconditional jump, I just don't care about the status of the ZF : jump does not change the EFLAGS, and it does not depend on any. However, it is still valid to leave the mov and test instructions there, because ZF is considered "live" by gcc across the test+jne instructions. Then, I patch mov and test in any order, because we just don't care about the status of the ZF, or do we... ? The only limitation is that a given imv_cond(var) should only be used in the following pattern : if (imv_cond(var)) ... Trying to save the result of imv_cond(var) and use it in multiple if() statements would cause the compiler to duplicate tests and branches on that variable and the pattern matching would not see that. I think it's what you fear. Now that you speak of it, it might be better to leave the movb and test instruction there to make sure we don't kill the ZF which might be needed by some other code. > b) there might be a jump into the middle of this instruction sequence? > If we change that, as discussed above, so the liveliness of ZF and of the %al register is still insured by leaving the mov and test instructions in place, we end up only modifying a single instruction and the problem fades away. We would end up changing a jne for a jmp. Thanks, Mathieu > -hpa -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:09 ` Mathieu Desnoyers @ 2008-04-25 18:37 ` Mathieu Desnoyers 2008-04-25 18:47 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 18:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: > * H. Peter Anvin (hpa@zytor.com) wrote: > > Mathieu Desnoyers wrote: > >> Yes, the immediate values, in general, only need to do atomic writes, > >> because I have taken care of placing the mov instruction in the correct > >> alignment so its immediate value happens to be aligned in memory. > >> However, the latest optimisation I did to change a conditional branch > >> into a jump when the correct code pattern is detected : > >> mov, test, bne short > >> into a > >> nop2, nop2, nop1, jmp short > >> or > >> mov, test, bne near > >> into a > >> nop2, nop2, nop1, jmp near > > > > And how, pray tell, do you deal with the fact that: > > > > a) the EFLAGS may be live on exit; > > Actually, not only EFLAGS can be live on exit, but also the immediate > value itself. > > If we take the mov, test, jne short case into account, I force the mov > to populate the %al register with some immediate value. Then, this value > is extracted from the inline assembly and feeded to an if() c statement > under the form of a variable. So, I check precisely for a mov %al,0, > followed by test and bne. If I don't find it (due to gcc optimizations), > then I leave the original immediate value there. I start the pattern > matching from the address of the movb instruction, which I extract from > the inline assembly. So, about the EFLAGS : given that I first change > the jne for an unconditional jump, I just don't care about the status of > the ZF : jump does not change the EFLAGS, and it does not depend on any. > However, it is still valid to leave the mov and test instructions there, > because ZF is considered "live" by gcc across the test+jne instructions. > > Then, I patch mov and test in any order, because we just don't care > about the status of the ZF, or do we... ? The only limitation is that a > given imv_cond(var) should only be used in the following pattern : > > if (imv_cond(var)) ... > > Trying to save the result of imv_cond(var) and use it in multiple if() > statements would cause the compiler to duplicate tests and branches on > that variable and the pattern matching would not see that. I think it's > what you fear. Now that you speak of it, it might be better to leave the > movb and test instruction there to make sure we don't kill the ZF which > might be needed by some other code. > Thinking about it, there could be a way to insure limited ZF and %al liveliness: adding an epilogue to the expected instruction sequence formed by an asm statement which clobbers the flags (flags are clobbered in any asm statement on x86) and clobbers %al. >From that point, we just have to find a specific signature that gcc could not imitate to put in this asm statement, so we can detect if other instructions have been placed in the middle of our sequence by gcc. Actually, I think the best thing to do with this asm statement is to put the instruction pointer in a special section, so we know that this code location marks the end of ZF and %al liveliness. There would be therefore no added code, just asm constraints. This epilogue should then be used on both branches of the condition, like this : if (unlikely(imv_cond(var))) { imv_cond_end(); ... } else { imv_cond_end(); ... } Where imv_cond_end() would look like this : +/* + * Puts a test and branch make sure the %al register and ZF are not live + * anymore. + * All asm statements clobbers the flags, but add "cc" clobber just to be sure. + * Clobbers %al. + */ +#define imv_cond_end() \ + do { \ + asm (".section __imv_cond_end,\"a\",@progbits\n\t" \ + _ASM_PTR "1f\n\t" \ + ".previous\n\t" \ + "1:\n\t" \ + : : : "a", "cc"); \ + } while (0) + The pattern to test for will therefore become : mov, test, branch, address following branch should be in the __imv_cond_end table. The address of the branch target site would also have to be in the __imv_cond_end table. > > b) there might be a jump into the middle of this instruction sequence? > > > > If we change that, as discussed above, so the liveliness of ZF and of > the %al register is still insured by leaving the mov and test > instructions in place, we end up only modifying a single instruction and > the problem fades away. We would end up changing a jne for a jmp. > So, if we do is I propose here, we have to take into account this question too. Any jump that jumps in the middle of this instruction sequence would have to insure correct liveliness of %al and ZF. However, since we just limited the scope of their liveliness, there are no other code paths which can jump in the middle of our instruction sequence and insure correct ZF and %al liveliness. Does it make sense ? Thanks, Mathieu > Thanks, > > Mathieu > > > -hpa > > -- > Mathieu Desnoyers > Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:37 ` Mathieu Desnoyers @ 2008-04-25 18:47 ` H. Peter Anvin 2008-04-25 19:19 ` H. Peter Anvin 2008-04-25 20:18 ` H. Peter Anvin 2 siblings, 0 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 18:47 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > Thinking about it, there could be a way to insure limited ZF and %al > liveliness: adding an epilogue to the expected instruction sequence > formed by an asm statement which clobbers the flags (flags are clobbered > in any asm statement on x86) and clobbers %al. > > From that point, we just have to find a specific signature that gcc > could not imitate to put in this asm statement, so we can detect if > other instructions have been placed in the middle of our sequence by > gcc. Actually, I think the best thing to do with this asm statement is > to put the instruction pointer in a special section, so we know that > this code location marks the end of ZF and %al liveliness. There would > be therefore no added code, just asm constraints. > > This epilogue should then be used on both branches of the condition, > like this : > > if (unlikely(imv_cond(var))) { > imv_cond_end(); > ... > } else { > imv_cond_end(); > ... > } > [...] > > Does it make sense ? > I don't think so. You're making way too many assumptions about the code generated by gcc. This kind of stuff absolutely can be done, *BUT* it requires the cooperation of the compiler. The right way to do this is to negotiate a set of appropriate builtins with the gcc people, and use them. This means this optimization will only work when compiled with the new gcc, so there is a substantial lag, but it's the only sane way to do this kind of stuff. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:37 ` Mathieu Desnoyers 2008-04-25 18:47 ` H. Peter Anvin @ 2008-04-25 19:19 ` H. Peter Anvin 2008-04-25 20:04 ` Mathieu Desnoyers 2008-04-25 20:18 ` H. Peter Anvin 2 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 19:19 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > >>> b) there might be a jump into the middle of this instruction sequence? >>> >> If we change that, as discussed above, so the liveliness of ZF and of >> the %al register is still insured by leaving the mov and test >> instructions in place, we end up only modifying a single instruction and >> the problem fades away. We would end up changing a jne for a jmp. > > So, if we do is I propose here, we have to take into account this > question too. Any jump that jumps in the middle of this instruction > sequence would have to insure correct liveliness of %al and ZF. However, > since we just limited the scope of their liveliness, there are no other > code paths which can jump in the middle of our instruction sequence and > insure correct ZF and %al liveliness. > I wanted to point out that this, in particular, is utter nonsense. Consider a sequence that looks something like this: if (foo ? bar : imv_cond(var)) { blah(); } An entirely sane transformation of this (as far as gcc is concerned), is something like: cmpl $0,foo je 1f cmpl $0,bar jmp 2f 1: #APP movb var,%al /* This is your imv */ #NO_APP testb %al,%al 2: je 3f call blah 3: Your code would take the movb-testb-je sequence and combine them, then we jump into the middle of the new instruction when jumping at 2! There are only two ways to deal with this - extensive analysis of the entire flow of control, or telling the compiler exactly what is *actually* going on. The latter is the preferred way, obviously. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 19:19 ` H. Peter Anvin @ 2008-04-25 20:04 ` Mathieu Desnoyers 2008-04-25 20:09 ` H. Peter Anvin 0 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 20:04 UTC (permalink / raw) To: H. Peter Anvin Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * H. Peter Anvin (hpa@zytor.com) wrote: > Mathieu Desnoyers wrote: >>>> b) there might be a jump into the middle of this instruction sequence? >>>> >>> If we change that, as discussed above, so the liveliness of ZF and of >>> the %al register is still insured by leaving the mov and test >>> instructions in place, we end up only modifying a single instruction and >>> the problem fades away. We would end up changing a jne for a jmp. >> So, if we do is I propose here, we have to take into account this >> question too. Any jump that jumps in the middle of this instruction >> sequence would have to insure correct liveliness of %al and ZF. However, >> since we just limited the scope of their liveliness, there are no other >> code paths which can jump in the middle of our instruction sequence and >> insure correct ZF and %al liveliness. > > I wanted to point out that this, in particular, is utter nonsense. Consider > a sequence that looks something like this: > > if (foo ? bar : imv_cond(var)) { > blah(); > } > > An entirely sane transformation of this (as far as gcc is concerned), is > something like: > > cmpl $0,foo > je 1f > cmpl $0,bar > jmp 2f > 1: > #APP > movb var,%al /* This is your imv */ > #NO_APP > testb %al,%al > 2: > je 3f > call blah > 3: > > Your code would take the movb-testb-je sequence and combine them, then we > jump into the middle of the new instruction when jumping at 2! > I am glad you come up with a counter argument. Let's look at what would happen here with my modified code : cmpl $0,foo je 1f cmpl $0,bar jmp 2f 1: #APP mov %esi, %esi /* nop 2 bytes */ #NO_APP mov %esi, %esi /* nop 2 bytes */ 2: jmp 3f /* 2 bytes short jump */ call blah 3: First of all, I do not "combine" the instructions.. that would be really dangerous (and bug-prone, since any interrupt could iret to an invalid instruction). No, all I do is to swap instructions for other instructions of the same size (or smaller in the case of jne 6 bytes -> nop1 + jmp 5 bytes). I see the problem you show here : it's dangerous to change an instruction generated by gcc because it can be re-used for other purposes, as in your example. Then, what I propose is the following : instead of modifying the conditional branch instruction, I prefix my inline assembly with a 5 bytes jump. I can then have the exact same behavior as the original conditional branch; I either jump at the address following the conditional branch or at the target address. I would still have to check for ZF and %al liveliness as I proposed earlier, because I would skip the movb and test instructions. > There are only two ways to deal with this - extensive analysis of the > entire flow of control, or telling the compiler exactly what is *actually* > going on. The latter is the preferred way, obviously. > Yes, in an ideal world, gcc would help here. Mathieu > -hpa > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:04 ` Mathieu Desnoyers @ 2008-04-25 20:09 ` H. Peter Anvin 0 siblings, 0 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 20:09 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > >> There are only two ways to deal with this - extensive analysis of the >> entire flow of control, or telling the compiler exactly what is *actually* >> going on. The latter is the preferred way, obviously. >> > > Yes, in an ideal world, gcc would help here. > gcc is a free software project, and the gcc maintainers are around and can be approached. A good proposal will go a long way, and patches will go even longer. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:37 ` Mathieu Desnoyers 2008-04-25 18:47 ` H. Peter Anvin 2008-04-25 19:19 ` H. Peter Anvin @ 2008-04-25 20:18 ` H. Peter Anvin 2008-04-25 20:37 ` Mathieu Desnoyers 2 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 20:18 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > This epilogue should then be used on both branches of the condition, > like this : > > if (unlikely(imv_cond(var))) { > imv_cond_end(); > ... > } else { > imv_cond_end(); > ... > } > > Where imv_cond_end() would look like this : > > +/* > + * Puts a test and branch make sure the %al register and ZF are not live > + * anymore. > + * All asm statements clobbers the flags, but add "cc" clobber just to be sure. > + * Clobbers %al. > + */ > +#define imv_cond_end() \ > + do { \ > + asm (".section __imv_cond_end,\"a\",@progbits\n\t" \ > + _ASM_PTR "1f\n\t" \ > + ".previous\n\t" \ > + "1:\n\t" \ > + : : : "a", "cc"); \ > + } while (0) > + > As far as this is concerned, all you accomplish here is that gcc, if it wants to re-use the %al value, will copy it into another register before doing your imv_conv_end(). -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:18 ` H. Peter Anvin @ 2008-04-25 20:37 ` Mathieu Desnoyers 2008-04-25 20:41 ` H. Peter Anvin 0 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 20:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * H. Peter Anvin (hpa@zytor.com) wrote: > Mathieu Desnoyers wrote: >> This epilogue should then be used on both branches of the condition, >> like this : >> if (unlikely(imv_cond(var))) { >> imv_cond_end(); >> ... >> } else { >> imv_cond_end(); >> ... >> } >> Where imv_cond_end() would look like this : >> +/* >> + * Puts a test and branch make sure the %al register and ZF are not live >> + * anymore. >> + * All asm statements clobbers the flags, but add "cc" clobber just to be >> sure. >> + * Clobbers %al. >> + */ >> +#define imv_cond_end() \ >> + do { \ >> + asm (".section __imv_cond_end,\"a\",@progbits\n\t" \ >> + _ASM_PTR "1f\n\t" \ >> + ".previous\n\t" \ >> + "1:\n\t" \ >> + : : : "a", "cc"); \ >> + } while (0) >> + > > As far as this is concerned, all you accomplish here is that gcc, if it > wants to re-use the %al value, will copy it into another register before > doing your imv_conv_end(). > Exactly, and by doing so, it will have to add instructions (mov, push..) in the instruction pattern I am looking for and therefore I will detect this and fall back on standard immediate values. Mathieu > -hpa -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:37 ` Mathieu Desnoyers @ 2008-04-25 20:41 ` H. Peter Anvin 2008-04-25 20:51 ` Linus Torvalds 2008-04-25 21:02 ` David Miller 0 siblings, 2 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 20:41 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: >>> + >> As far as this is concerned, all you accomplish here is that gcc, if it >> wants to re-use the %al value, will copy it into another register before >> doing your imv_conv_end(). >> > > Exactly, and by doing so, it will have to add instructions (mov, push..) > in the instruction pattern I am looking for and therefore I will detect > this and fall back on standard immediate values. > So what you're saying is you'll follow all the branches of code until you detect an immediate value (and eflags) kill. Yes, that should work. It's still ugly, and I have to say I find the complexity rather distasteful. I am willing to be convinced it's worth it, but I would really like to see hard numbers. Personally, I wouldn't be all that surprised if you lost more in constraining gcc scheduling than you gain. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:41 ` H. Peter Anvin @ 2008-04-25 20:51 ` Linus Torvalds 2008-04-25 21:12 ` Mathieu Desnoyers 2008-04-25 21:02 ` David Miller 1 sibling, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 20:51 UTC (permalink / raw) To: H. Peter Anvin Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge On Fri, 25 Apr 2008, H. Peter Anvin wrote: > > Yes, that should work. It's still ugly, and I have to say I find the > complexity rather distasteful. I am willing to be convinced it's worth it, > but I would really like to see hard numbers. I really cannot imagine that this kind of pain is *ever* worth it. Please give an example of something so important that we'd want to do complex code rewriting on the fly. What _is_ the point of imv_cond()? Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:51 ` Linus Torvalds @ 2008-04-25 21:12 ` Mathieu Desnoyers 2008-04-25 21:15 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 21:12 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Fri, 25 Apr 2008, H. Peter Anvin wrote: > > > > Yes, that should work. It's still ugly, and I have to say I find the > > complexity rather distasteful. I am willing to be convinced it's worth it, > > but I would really like to see hard numbers. > > I really cannot imagine that this kind of pain is *ever* worth it. > > Please give an example of something so important that we'd want to do > complex code rewriting on the fly. What _is_ the point of imv_cond()? > > Linus The point is to provide a way to dynamically enable code at runtime without noticeable performance impact on the system. It's principally useful to control the markers in the kernel, which can be placed in very frequently executed code paths. The original markers add a memory read, test and conditional branch at each marker site. By using the immediate values patchset, it goes down to a load immediate value, test and branch. However, Ingo was still unhappy with the conditional branch, so I cooked this jump patching optimization on top of the immediate values. It looks for an expected pattern which limits the liveliness of the %al and ZF registers to the 3 instructions and, if it finds it, patches a jump located just before the mov instruction to skip the whole pattern and behave exactly like the conditional branch. So basically we get code dynamically actvated by patching a single jump. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:12 ` Mathieu Desnoyers @ 2008-04-25 21:15 ` H. Peter Anvin 2008-04-25 21:47 ` Mathieu Desnoyers 2008-04-25 22:04 ` Linus Torvalds 2008-04-26 6:50 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 21:15 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > The point is to provide a way to dynamically enable code at runtime > without noticeable performance impact on the system. It's principally > useful to control the markers in the kernel, which can be placed in very > frequently executed code paths. The original markers add a memory read, > test and conditional branch at each marker site. By using the immediate > values patchset, it goes down to a load immediate value, test and branch. > > However, Ingo was still unhappy with the conditional branch, so I cooked > this jump patching optimization on top of the immediate values. It > looks for an expected pattern which limits the liveliness of the %al and > ZF registers to the 3 instructions and, if it finds it, patches a jump > located just before the mov instruction to skip the whole pattern and > behave exactly like the conditional branch. > > So basically we get code dynamically actvated by patching a single jump. > Note that all these optimizations only make sense if the case where we *take* the "marker" is frequent, *and* the marker itself is not too expensive. If that is not the case, just put in a noop that is dynamically patched to an INT3 or ICEBP instruction (one byte) or an INT instruction (two bytes), take the exception, look up the source address and revector to the marker code. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:15 ` H. Peter Anvin @ 2008-04-25 21:47 ` Mathieu Desnoyers 2008-04-25 22:07 ` H. Peter Anvin 0 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 21:47 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * H. Peter Anvin (hpa@zytor.com) wrote: > Mathieu Desnoyers wrote: >> The point is to provide a way to dynamically enable code at runtime >> without noticeable performance impact on the system. It's principally >> useful to control the markers in the kernel, which can be placed in very >> frequently executed code paths. The original markers add a memory read, >> test and conditional branch at each marker site. By using the immediate >> values patchset, it goes down to a load immediate value, test and branch. >> However, Ingo was still unhappy with the conditional branch, so I cooked >> this jump patching optimization on top of the immediate values. It >> looks for an expected pattern which limits the liveliness of the %al and >> ZF registers to the 3 instructions and, if it finds it, patches a jump >> located just before the mov instruction to skip the whole pattern and >> behave exactly like the conditional branch. >> So basically we get code dynamically actvated by patching a single jump. > > Note that all these optimizations only make sense if the case where we > *take* the "marker" is frequent, *and* the marker itself is not too > expensive. > Yes, this is the case. Using breakpoints for markers quickly becomes noticeable for thing such as scheduler instrumentation, page fault handler instrumentation, etc. And yes, I have developed kernel tracer, LTTng, which takes care of writing the data to trace buffers efficiently. The last time I took performance measurements, it was performing locking and writing to the memory buffer in about 270ns on a 3GHz Pentium 4. It might be a tiny bit slower now that it parses the markers format strings dynamically, but nothing very significant. But there is another point that markers do which the breakpoint won't give you : they extract local variables from functions and they identify them with field names which separates the instrumentation from the actual kernel implementation details. In order to do that, I rely on gcc building a stack frame for a function call, which I don't want to build unnecessarity when the marker is disabled. This is why I use a jump to skip passing the arguments on the stack and the function call. Mathieu > If that is not the case, just put in a noop that is dynamically patched to > an INT3 or ICEBP instruction (one byte) or an INT instruction (two bytes), > take the exception, look up the source address and revector to the marker > code. > > -hpa -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:47 ` Mathieu Desnoyers @ 2008-04-25 22:07 ` H. Peter Anvin 2008-04-25 22:30 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 22:07 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > Yes, this is the case. Using breakpoints for markers quickly becomes > noticeable for thing such as scheduler instrumentation, page fault > handler instrumentation, etc. And yes, I have developed kernel tracer, > LTTng, which takes care of writing the data to trace buffers > efficiently. The last time I took performance measurements, it was > performing locking and writing to the memory buffer in about 270ns on a > 3GHz Pentium 4. It might be a tiny bit slower now that it parses the > markers format strings dynamically, but nothing very significant. > > But there is another point that markers do which the breakpoint won't > give you : they extract local variables from functions and they identify > them with field names which separates the instrumentation from the > actual kernel implementation details. In order to do that, I rely on gcc > building a stack frame for a function call, which I don't want to build > unnecessarity when the marker is disabled. This is why I use a jump to > skip passing the arguments on the stack and the function call. > Well, debuggers do it, and that's ultimately what why we have debugging annotation formats like DWARF2 - to be able to take an arbitrary state and decode local variables from the combined register-memory state. This is often done by an interpreter, but that's not necessary; a compiler can use the debugging information and build appropriate capture code, which would be able to execute very quickly. Not only is this capable of extracting arbitrary information, but it also guarantees that the extraction code is out of line. The act of building a stack frame not only preturbs the generated code (gcc has to guarantee liveness, which you can see as a pro or a con), but it also puts a fair amount of code in the icache path of the function. Now, if a breakpoint is too expensive, one can do exactly the same trick with a naked call instruction, with a higher icache impact in the unused case (five bytes instead of one or two). However, the key to low impact is to use the debugging information to recover state. (Liveness at the probe point is still possible to enforce with this technique: give gcc a "g" read constraint as part of the probe instruction. That makes gcc ensure the information is *somewhere*. The debugging information will tell you where to pick it up from. Obviously, any time liveness is enforce you suffer a potential cost.) -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:07 ` H. Peter Anvin @ 2008-04-25 22:30 ` Mathieu Desnoyers 2008-04-25 22:36 ` Linus Torvalds 2008-04-25 22:38 ` H. Peter Anvin 0 siblings, 2 replies; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 22:30 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * H. Peter Anvin (hpa@zytor.com) wrote: > Mathieu Desnoyers wrote: >> Yes, this is the case. Using breakpoints for markers quickly becomes >> noticeable for thing such as scheduler instrumentation, page fault >> handler instrumentation, etc. And yes, I have developed kernel tracer, >> LTTng, which takes care of writing the data to trace buffers >> efficiently. The last time I took performance measurements, it was >> performing locking and writing to the memory buffer in about 270ns on a >> 3GHz Pentium 4. It might be a tiny bit slower now that it parses the >> markers format strings dynamically, but nothing very significant. >> But there is another point that markers do which the breakpoint won't >> give you : they extract local variables from functions and they identify >> them with field names which separates the instrumentation from the >> actual kernel implementation details. In order to do that, I rely on gcc >> building a stack frame for a function call, which I don't want to build >> unnecessarity when the marker is disabled. This is why I use a jump to >> skip passing the arguments on the stack and the function call. > > Well, debuggers do it, and that's ultimately what why we have debugging > annotation formats like DWARF2 - to be able to take an arbitrary state and > decode local variables from the combined register-memory state. This is > often done by an interpreter, but that's not necessary; a compiler can use > the debugging information and build appropriate capture code, which would > be able to execute very quickly. Not only is this capable of extracting > arbitrary information, but it also guarantees that the extraction code is > out of line. > DWARF2 is capable of extracting information only when not optimized away by the compiler. That's the whole point of markers : liveness is good in this case because we make sure the variable is there, not that it *might* be there. The latter case might be good enough for a debugger, but not for a production system tracer. > The act of building a stack frame not only preturbs the generated code (gcc > has to guarantee liveness, which you can see as a pro or a con), but it > also puts a fair amount of code in the icache path of the function. > if (unlikely(condition)) function_call(params); The builtin expect will take care to put the instructions out of the hot paths and therefore leave them out of the icache with gcc -freorder-blocks (in -O2). The only addition to the frequently used icache is, in this case, the 5 bytes jump, 2 bytes mov, 2 bytes test and 2 (or 6) bytes conditional branch, for a total of 11 bytes for small functions and 15 bytes for functions which require near jumps. > Now, if a breakpoint is too expensive, one can do exactly the same trick > with a naked call instruction, with a higher icache impact in the unused > case (five bytes instead of one or two). However, the key to low impact is > to use the debugging information to recover state. > The runtime cost of function call is bigger than the jump. I don't see what this buys us. > (Liveness at the probe point is still possible to enforce with this > technique: give gcc a "g" read constraint as part of the probe instruction. > That makes gcc ensure the information is *somewhere*. The debugging > information will tell you where to pick it up from. Obviously, any time > liveness is enforce you suffer a potential cost.) It could be possible to do so. However, passing a variable argument list to a marker is rather more flexible than those inline assembly constraints. And you are still tied to the variable names and offer no abstraction between the kernel implementation and the conceptual name associated to a traced variable. Mathieu > > -hpa -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:30 ` Mathieu Desnoyers @ 2008-04-25 22:36 ` Linus Torvalds 2008-04-28 20:21 ` Ingo Molnar 2008-04-28 20:43 ` Mathieu Desnoyers 2008-04-25 22:38 ` H. Peter Anvin 1 sibling, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 22:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Mathieu Desnoyers wrote: > > DWARF2 is capable of extracting information only when not optimized away > by the compiler. That's the whole point of markers : liveness is good in > this case because we make sure the variable is there, not that it > *might* be there. The latter case might be good enough for a debugger, > but not for a production system tracer. This is why you really do want to recompile the function entirely if you're debugging it. Because it might simply not be debuggable in its normal state. I'd much rather see something truly generic that doesn't need any pre-inserted "markers" at all that disable optimizations, and that allows just about anything. Including live system bug-fixes etc (imagine finding a bug - and not at somethign that was previously already "marked" - and just replacing the buggy function with a non-buggy one). Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:36 ` Linus Torvalds @ 2008-04-28 20:21 ` Ingo Molnar 2008-04-28 20:55 ` Jeremy Fitzhardinge 2008-04-28 20:43 ` Mathieu Desnoyers 1 sibling, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-28 20:21 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > I'd much rather see something truly generic that doesn't need any > pre-inserted "markers" at all that disable optimizations, and that > allows just about anything. Including live system bug-fixes etc > (imagine finding a bug - and not at somethign that was previously > already "marked" - and just replacing the buggy function with a > non-buggy one). Ob'plug: with the pending dyn-ftrace function tracer feature we do something rather close to that already: we have a 5 byte NOP in the prologue of every function that can be used as a non-destructive 'branch away' place. Right now we use that to trace a (regex-ish pattern identified) set of functions. The regex pattern can be configured runtime via /debug/tracing/function_filter is not parsed runtime in any fastpath - it is used to activate/deactivate the tracepoints and patches them from NOPs into CALLs. _But_ the same mechanism could perhaps be used to patch the function as well. The cost is +5 bytes of NOP for every function in the system, but in practice we've not been able to measure any actual runtime costs of these NOPs - neither in micro-benchmarks nor in macro-benchmarks. (the only real cost here is the +5 bytes of I$ cost - otherwise the NOP will just be skipped by the decoder.) the patching of these NOPs is inherently safe because they are inserted at build time. There's no negative impact to gcc optimizations at all. We get a nice selection of 75,000 tracepoints in an allmodconfig kernel - without _any_ source code level impact in the functions. On the other hand, i'm not opposed to a handful of static markers either - i think the best model is to have both of these facilities. There are a couple of 'core events' that are not expressed via function calls, and even where they are expressed via function calls the function call layout is not stable while markers are stable across kernel versions. The notion of "a context-switch happened from task X to task Y" or "task X woke up task Y" is not going to change anytime soon so i'm not opposed to exposing that kind of information. And once we accept the static markers, we might as well make them as cheap as possible. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 20:21 ` Ingo Molnar @ 2008-04-28 20:55 ` Jeremy Fitzhardinge 2008-04-28 21:01 ` H. Peter Anvin 0 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-28 20:55 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Ingo Molnar wrote: > And once we accept the static > markers, we might as well make them as cheap as possible. Sure, so long as you take "as cheap as possible" to mean cheap in both implementation complexity as well as runtime cost. I don't have any specific objections to any of the stuff that Mathieu is working on, but it does worry me that each time a problem is addressed it ends up being an even more subtle piece of code. I just haven't seen enough concrete justification to make me feel comfortable with it all. It seems to me that a relatively simple implementation which allows the desired tracing/marking functionality is the first step. If that proves to cause a significant performance deficit then enabled then we can work out how to address it in due course. But doing it all at once before merging anything seems like overkill, particularly when we're talking about specifics of gcc's codegen patterns, disassembling code fragments, etc. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 20:55 ` Jeremy Fitzhardinge @ 2008-04-28 21:01 ` H. Peter Anvin 2008-04-28 22:42 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-28 21:01 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: >> And once we accept the static markers, we might as well make them as >> cheap as possible. > > Sure, so long as you take "as cheap as possible" to mean cheap in both > implementation complexity as well as runtime cost. > > I don't have any specific objections to any of the stuff that Mathieu is > working on, but it does worry me that each time a problem is addressed > it ends up being an even more subtle piece of code. I just haven't seen > enough concrete justification to make me feel comfortable with it all. > > It seems to me that a relatively simple implementation which allows the > desired tracing/marking functionality is the first step. If that proves > to cause a significant performance deficit then enabled then we can work > out how to address it in due course. But doing it all at once before > merging anything seems like overkill, particularly when we're talking > about specifics of gcc's codegen patterns, disassembling code fragments, > etc. > I really feel that the latest information that has come up has indicated that things are really not what they should be. They are in line, have a substantial probe cost, and we're messing around with how to jump around them. That's not the problem. I maintain what I said before: a call instruction (which defaults to a NOP), and then extract the state based on debugging info or assembler annotations. As far as patchable static jumps, I can see the utility of them, but I don't think this project is one of them. However, I believe the right way to do them is via compiler support. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 21:01 ` H. Peter Anvin @ 2008-04-28 22:42 ` Mathieu Desnoyers 0 siblings, 0 replies; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-28 22:42 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Ingo Molnar, Linus Torvalds, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec * H. Peter Anvin (hpa@zytor.com) wrote: > Jeremy Fitzhardinge wrote: >> Ingo Molnar wrote: >>> And once we accept the static markers, we might as well make them as >>> cheap as possible. >> Sure, so long as you take "as cheap as possible" to mean cheap in both >> implementation complexity as well as runtime cost. >> I don't have any specific objections to any of the stuff that Mathieu is >> working on, but it does worry me that each time a problem is addressed it >> ends up being an even more subtle piece of code. I just haven't seen >> enough concrete justification to make me feel comfortable with it all. >> It seems to me that a relatively simple implementation which allows the >> desired tracing/marking functionality is the first step. If that proves >> to cause a significant performance deficit then enabled then we can work >> out how to address it in due course. But doing it all at once before >> merging anything seems like overkill, particularly when we're talking >> about specifics of gcc's codegen patterns, disassembling code fragments, >> etc. > > I really feel that the latest information that has come up has indicated > that things are really not what they should be. They are in line, have a Do you consider all unlikely blocks to be in line ? If the real issue is to make sure they don't share cache lines with the body of the function, that could be arranged. However, I assume that using an unlikely branch to let gcc with -freorder-blocks put the instructions at the end of the function is enough. > substantial probe cost, When disabled : 0 cycles ? It additionnally clobbers eax and the EFLAGS. For the parameters passed to the marker, I think the marker location should be chosen carefully so most of the variables would be live anyway even without a marker. > and we're messing around with how to jump around > them. I was perfectly happy with the immediate value + conditional branch, but for apparently 0 cycles is more appealing than 2 :-) > > That's not the problem. > > I maintain what I said before: a call instruction (which defaults to a > NOP), and then extract the state based on debugging info or assembler > annotations. > Let's consider this option : First of all, I wouldn't like to require tracing users to get the kernel debuginfos each time they want to trace. I think it should be a the "on" switch kind of infrastructure. Getting a few hundreds MB worth of data isn't exactly that. If I get your idea right, you propose to use an inline assembly with "g" constraints to make sure gcc lets them alive. I just did some testing of your approach applied to a marker in schedule() that shows that as soon as you need to dereference a pointer in the parameters, this adds operations in the fast path, which is not the case for markers because, as Ingo explained, this is done in a block outside the fast path. So your assembly constraint solution works fine only if the information happens to be there, in a register, at the inline assembly site. Then there is no added cost for register preparation. However, given it won't always be true, you have to bear the cost of setting up the registers from the stack or, worse, from a pointer read in the function fast path. The markers offloads this to the jump target located outside of the fast path. Therefore, in the general case which includes parameters not present in the registers, markers seems like a more palatable solution. > As far as patchable static jumps, I can see the utility of them, but I > don't think this project is one of them. However, I believe the right way > to do them is via compiler support. > If you suppose the information is always live in registers at the instrumented site, then yes, I guess your constraint+call approach is good, modulo the fact that users will depend on hundreds of megabytes of debuginfo. However, in order to populate registers appropriately with a wider range of parameters without adding instructions to the fast path, markers, which add instructions in a cache-cold block seems like a good way to go. And that depends on the ability to branch efficiently to that block, when enabled, in order to prepare the stack and do the call. Mathieu > -hpa -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:36 ` Linus Torvalds 2008-04-28 20:21 ` Ingo Molnar @ 2008-04-28 20:43 ` Mathieu Desnoyers 2008-04-28 21:02 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-28 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Fri, 25 Apr 2008, Mathieu Desnoyers wrote: > > > > DWARF2 is capable of extracting information only when not optimized away > > by the compiler. That's the whole point of markers : liveness is good in > > this case because we make sure the variable is there, not that it > > *might* be there. The latter case might be good enough for a debugger, > > but not for a production system tracer. > > This is why you really do want to recompile the function entirely if > you're debugging it. Because it might simply not be debuggable in its > normal state. > > I'd much rather see something truly generic that doesn't need any > pre-inserted "markers" at all that disable optimizations, and that allows Markers, with immediate values, only clobbers the eax register and the ZF. It does not restrain inlining nor loop unrolling. It also requires gcc to leave the variables in which the marker is interested "live". Are you referring to other optimizations I wouldn't have though of ? > just about anything. Including live system bug-fixes etc (imagine finding > a bug - and not at somethign that was previously already "marked" - and > just replacing the buggy function with a non-buggy one). > > Linus kprobes is already doing a good job at probing a live system without rebooting. Markers are best suited to export information about kernel events which stays stable between releases so the information is readily available in the kernel, with low overhead even when tracing is enabled (which kprobes doesn't provide) which allows a user to flip the "on" switch and get a trace of all system calls, scheduling, traps, interrupts that happen on the system. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 20:43 ` Mathieu Desnoyers @ 2008-04-28 21:02 ` Jeremy Fitzhardinge 2008-05-04 15:03 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-28 21:02 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Mathieu Desnoyers wrote: > Markers, with immediate values, only clobbers the eax register and the > ZF. It does not restrain inlining nor loop unrolling. It also requires > gcc to leave the variables in which the marker is interested "live". That in itself is pretty significant. If that value would otherwise be constant folded or strength-reduced away, you're putting a big limitation on what the compiler can do. The mere fact that its necessary to do something to preserve many values shows how much the compiler transforms the code away from what's in the source, and specifically referencing otherwise unused intermediates inhibits that. In other words, if you weren't preventing optimisations, you wouldn't need to preserve values as much, because the optimiser wouldn't be getting rid of them. If you need to preserve lots of values, you're necessarily preventing the optimiser from doing its job. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-28 21:02 ` Jeremy Fitzhardinge @ 2008-05-04 15:03 ` Mathieu Desnoyers 2008-05-04 16:18 ` H. Peter Anvin 0 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-05-04 15:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec * Jeremy Fitzhardinge (jeremy@goop.org) wrote: > Mathieu Desnoyers wrote: >> Markers, with immediate values, only clobbers the eax register and the >> ZF. It does not restrain inlining nor loop unrolling. It also requires >> gcc to leave the variables in which the marker is interested "live". > > That in itself is pretty significant. If that value would otherwise be > constant folded or strength-reduced away, you're putting a big limitation > on what the compiler can do. The mere fact that its necessary to do > something to preserve many values shows how much the compiler transforms > the code away from what's in the source, and specifically referencing > otherwise unused intermediates inhibits that. > > In other words, if you weren't preventing optimisations, you wouldn't need > to preserve values as much, because the optimiser wouldn't be getting rid > of them. If you need to preserve lots of values, you're necessarily > preventing the optimiser from doing its job. > > J I am not saying that the standard marker will have to inhibit optimizations. Actually, it's the contrary : a well-thought marker should _not_ modify that kind of optimization, and we should put markers in code locations less likely to inhibit gcc optimizations. However, in the case where we happen to be interested in information otherwise optimized away by GCC, it makes sense to inhibit this optimization in order to have the information available for tracing. I expect this to happen rarely, but I think we must deal with optimizations to make sure we never trace garbage due to some unexpected gcc optimization. I think it's a small (e.g. undetectable at the macrobenchmark level) price to pay to get correct tracing information. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-05-04 15:03 ` Mathieu Desnoyers @ 2008-05-04 16:18 ` H. Peter Anvin 0 siblings, 0 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-05-04 16:18 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jeremy Fitzhardinge, Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Mathieu Desnoyers wrote: > > I am not saying that the standard marker will have to inhibit > optimizations. Actually, it's the contrary : a well-thought marker > should _not_ modify that kind of optimization, and we should put markers > in code locations less likely to inhibit gcc optimizations. However, in > the case where we happen to be interested in information otherwise > optimized away by GCC, it makes sense to inhibit this optimization in > order to have the information available for tracing. > > I expect this to happen rarely, but I think we must deal with > optimizations to make sure we never trace garbage due to some unexpected > gcc optimization. I think it's a small (e.g. undetectable at the > macrobenchmark level) price to pay to get correct tracing information. > That's a pretty flippant reply... liveness causes register pressure which can cause rapid degradation in code quality on a register-starved architecture like x86. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:30 ` Mathieu Desnoyers 2008-04-25 22:36 ` Linus Torvalds @ 2008-04-25 22:38 ` H. Peter Anvin 1 sibling, 0 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 22:38 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Mathieu Desnoyers wrote: > > DWARF2 is capable of extracting information only when not optimized away > by the compiler. That's the whole point of markers : liveness is good in > this case because we make sure the variable is there, not that it > *might* be there. The latter case might be good enough for a debugger, > but not for a production system tracer. > That's what I address with the last paragraph of the email. > > The builtin expect will take care to put the instructions out of the > hot paths and therefore leave them out of the icache with gcc > -freorder-blocks (in -O2). The only addition to the frequently used > icache is, in this case, the 5 bytes jump, 2 bytes mov, 2 bytes test and > 2 (or 6) bytes conditional branch, for a total of 11 bytes for small > functions and 15 bytes for functions which require near jumps. > >> Now, if a breakpoint is too expensive, one can do exactly the same trick >> with a naked call instruction, with a higher icache impact in the unused >> case (five bytes instead of one or two). However, the key to low impact is >> to use the debugging information to recover state. > > The runtime cost of function call is bigger than the jump. I don't see > what this buys us. You get zero instructions and five bytes of NOP in the non-taken case. In the taken case, you move the whole thing out of line. >> (Liveness at the probe point is still possible to enforce with this >> technique: give gcc a "g" read constraint as part of the probe instruction. >> That makes gcc ensure the information is *somewhere*. The debugging >> information will tell you where to pick it up from. Obviously, any time >> liveness is enforce you suffer a potential cost.) > > It could be possible to do so. However, passing a variable argument list > to a marker is rather more flexible than those inline assembly > constraints. And you are still tied to the variable names and offer no > abstraction between the kernel implementation and the conceptual name > associated to a traced variable. "Rather more flexible?" Surely you're joking, Mr. Feynman? There is no difference, none, nada. Furthermore, your capture stub compiler, or trace data extractor, can do any kind of mapping it pleases; so I'm utterly confused what you're talking about "still tied to variable names." -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:12 ` Mathieu Desnoyers 2008-04-25 21:15 ` H. Peter Anvin @ 2008-04-25 22:04 ` Linus Torvalds 2008-04-25 23:00 ` Mathieu Desnoyers ` (2 more replies) 2008-04-26 6:50 ` Jeremy Fitzhardinge 2 siblings, 3 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 22:04 UTC (permalink / raw) To: Mathieu Desnoyers Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Mathieu Desnoyers wrote: > > The point is to provide a way to dynamically enable code at runtime > without noticeable performance impact on the system. Quite frankly, maybe I'm a bit dense, but why don't you just recompile the whole original function (at run-time), load that new version of a function as a mini-module, and then insert a marker at the top of the old function that just does a "jmp replacementfunction". That has _zero_ cost for the non-marker case, and allows you to do pretty much any arbitrary code changes for the marker case. It's also a much simpler replacement. Yeah, that "jmp replacementfunction" is five or more bytes, but you can trivially do the actual _replacement_ write by writing it first as a single-byte debug trap, and after that has been written, write the target address after it, and then write the first byte of the "jmp" instruction last. In the (very unlikely) case that another CPU hits that debug trap, you just fix it up in the debug handler - you only need a single datum of "this is where that debug trap should relocate", because you simply create a triial spinlock around the code-sequence that does the instruction rewrite. When undoing it, just do the same thing in reverse. Yeah, this requires you to basically recompile some function snippet when you insert a probe, but if that scares people, you could basically do it using the old code and inserting the markers and "relinking" it - avoiding the C compiler, and just basically have an "assembly recompiler". And yeah, maybe you want to do without the use of modules, and you'd just have a memory area that is kept free for these kinds of code replacement issues. And you can optimize it to not recompile the whole function, but do it on a finer granularity if you want. And sure, you want to really make sure that there is security in place so that this isn't used for rootkits, but isn't that true of pretty much *any* trace facility? Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:04 ` Linus Torvalds @ 2008-04-25 23:00 ` Mathieu Desnoyers 2008-04-25 23:13 ` Jeremy Fitzhardinge 2008-04-26 2:12 ` Frank Ch. Eigler 2008-06-05 17:44 ` Frank Ch. Eigler 2 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 23:00 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge, Frank Ch. Eigler, systemtap * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Fri, 25 Apr 2008, Mathieu Desnoyers wrote: > > > > The point is to provide a way to dynamically enable code at runtime > > without noticeable performance impact on the system. > > Quite frankly, maybe I'm a bit dense, but why don't you just recompile the > whole original function (at run-time), load that new version of a function > as a mini-module, and then insert a marker at the top of the old function > that just does a "jmp replacementfunction". > > That has _zero_ cost for the non-marker case, and allows you to do pretty > much any arbitrary code changes for the marker case. > > It's also a much simpler replacement. > This idea has been considered a few years ago at OLS in the tracing BOF if I remember well. The results were this : First, there is no way to guarantee that no code path, nor any return address from any function, interrupt, sleeping thread, will return to the "old" version of the function. Nor is it possible to determine when a quiescent state is reached. Therefore, we couldn't see how we can do the teardown. The second point is dependency between execution flow and variables. If we don't do a complete copy of the variables (which I don't see how we can do atomically), we will have to share the variables between the old and the new copies of the functions. However, some variables might encode information about the execution flow of the program and depend on the actual address at which the code is linked (function pointers for instance). Stuff like "goto *addr" would also break. > Yeah, that "jmp replacementfunction" is five or more bytes, but you can > trivially do the actual _replacement_ write by writing it first as a > single-byte debug trap, and after that has been written, write the target > address after it, and then write the first byte of the "jmp" instruction > last. In the (very unlikely) case that another CPU hits that debug trap, > you just fix it up in the debug handler - you only need a single datum of > "this is where that debug trap should relocate", because you simply create > a triial spinlock around the code-sequence that does the instruction > rewrite. > That's actually what I do in my immediate values implementation. > When undoing it, just do the same thing in reverse. > > Yeah, this requires you to basically recompile some function snippet when > you insert a probe, but if that scares people, you could basically do it > using the old code and inserting the markers and "relinking" it - avoiding > the C compiler, and just basically have an "assembly recompiler". > > And yeah, maybe you want to do without the use of modules, and you'd just > have a memory area that is kept free for these kinds of code replacement > issues. And you can optimize it to not recompile the whole function, but > do it on a finer granularity if you want. > Then dealing with multiple code patching infrastructures (kprobes, alternatives, paravirt) would become hellish. If a kprobe is planted in the original version of the function, we have to insert it in the new version... and the teardown of the old function is still a problem. > And sure, you want to really make sure that there is security in place so > that this isn't used for rootkits, but isn't that true of pretty much > *any* trace facility? > Yep. The discussion I refer to took place at OLS a few years ago. Other participants might remember some other details I forgot. Mathieu > Linus -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 23:00 ` Mathieu Desnoyers @ 2008-04-25 23:13 ` Jeremy Fitzhardinge 2008-04-25 23:34 ` Masami Hiramatsu 0 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-25 23:13 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Mathieu Desnoyers wrote: > This idea has been considered a few years ago at OLS in the tracing BOF > if I remember well. The results were this : First, there is no way to > guarantee that no code path, nor any return address from any function, > interrupt, sleeping thread, will return to the "old" version of the > function. Nor is it possible to determine when a quiescent state is > reached. Therefore, we couldn't see how we can do the teardown. > Does that matter? The new function is semantically identical to the old one, and the old code will remain in place. If there's still users in the old function it may take a while for them to get flushed out (and won't be traced in the meantime), but you have to expect some missed events if you're shoving any kind of dynamic marker into the code. The main problem is if there's something still depending on the first 5 bytes of the function (most likely if there's a loop head somewhere near the top of the function). Updating the markers would mean you'd leave a trail of old versions hanging around as modules, but that's not a huge cost... > The second point is dependency between execution flow and variables. If > we don't do a complete copy of the variables (which I don't see how we > can do atomically), we will have to share the variables between the old > and the new copies of the functions. However, some variables might > encode information about the execution flow of the program and depend on > the actual address at which the code is linked (function pointers for > instance). Stuff like "goto *addr" would also break. > Obviously you'd only pick up new callers of the function, which would mean that they'd pick up the new versions of those function-local things. Though you'd need to make sure that the new versions of the function are using the old version's static variables... > > Then dealing with multiple code patching infrastructures (kprobes, > alternatives, paravirt) would become hellish. If a kprobe is planted in > the original version of the function, we have to insert it in the new > version... and the teardown of the old function is still a problem. > The module machinery already deals with patching paravirt and alternatives into loaded modules. Your bespoke module would get dealt with like any other module. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 23:13 ` Jeremy Fitzhardinge @ 2008-04-25 23:34 ` Masami Hiramatsu 2008-04-26 6:21 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 183+ messages in thread From: Masami Hiramatsu @ 2008-04-25 23:34 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Jeremy Fitzhardinge wrote: > Mathieu Desnoyers wrote: >> This idea has been considered a few years ago at OLS in the tracing BOF >> if I remember well. The results were this : First, there is no way to >> guarantee that no code path, nor any return address from any function, >> interrupt, sleeping thread, will return to the "old" version of the >> function. Nor is it possible to determine when a quiescent state is >> reached. Therefore, we couldn't see how we can do the teardown. >> > > Does that matter? The new function is semantically identical to the old > one, and the old code will remain in place. If there's still users in > the old function it may take a while for them to get flushed out (and > won't be traced in the meantime), but you have to expect some missed > events if you're shoving any kind of dynamic marker into the code. The > main problem is if there's something still depending on the first 5 > bytes of the function (most likely if there's a loop head somewhere near > the top of the function). I think we have to ensure no threads sleeping or being interrupted on the function when removing new function. How would you check it? -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 23:34 ` Masami Hiramatsu @ 2008-04-26 6:21 ` Jeremy Fitzhardinge 2008-04-26 11:56 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-26 6:21 UTC (permalink / raw) To: Masami Hiramatsu Cc: Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Masami Hiramatsu wrote: > Jeremy Fitzhardinge wrote: > >> Mathieu Desnoyers wrote: >> >>> This idea has been considered a few years ago at OLS in the tracing BOF >>> if I remember well. The results were this : First, there is no way to >>> guarantee that no code path, nor any return address from any function, >>> interrupt, sleeping thread, will return to the "old" version of the >>> function. Nor is it possible to determine when a quiescent state is >>> reached. Therefore, we couldn't see how we can do the teardown. >>> >>> >> Does that matter? The new function is semantically identical to the old >> one, and the old code will remain in place. If there's still users in >> the old function it may take a while for them to get flushed out (and >> won't be traced in the meantime), but you have to expect some missed >> events if you're shoving any kind of dynamic marker into the code. The >> main problem is if there's something still depending on the first 5 >> bytes of the function (most likely if there's a loop head somewhere near >> the top of the function). >> > > I think we have to ensure no threads sleeping or being interrupted on > the function when removing new function. How would you check it? > Not sure I follow you. You'd never remove any code. But you'd only start tracing new callers of the function. If the function loops indefinitely, you could potentially have some users which never end up getting traced. Also, if those users depend on instructions in the first 5 bytes of the function, they would crash because of the jump to the new function patched on top of them. Overall, it doesn't seem like a very satisfactory mechanism... J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 6:21 ` Jeremy Fitzhardinge @ 2008-04-26 11:56 ` Arnaldo Carvalho de Melo 2008-04-26 23:38 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 183+ messages in thread From: Arnaldo Carvalho de Melo @ 2008-04-26 11:56 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Em Fri, Apr 25, 2008 at 11:21:54PM -0700, Jeremy Fitzhardinge escreveu: > Masami Hiramatsu wrote: >> Jeremy Fitzhardinge wrote: >> >>> Mathieu Desnoyers wrote: >>> >>>> This idea has been considered a few years ago at OLS in the tracing BOF >>>> if I remember well. The results were this : First, there is no way to >>>> guarantee that no code path, nor any return address from any function, >>>> interrupt, sleeping thread, will return to the "old" version of the >>>> function. Nor is it possible to determine when a quiescent state is >>>> reached. Therefore, we couldn't see how we can do the teardown. >>>> >>> Does that matter? The new function is semantically identical to the old >>> one, and the old code will remain in place. If there's still users in >>> the old function it may take a while for them to get flushed out (and >>> won't be traced in the meantime), but you have to expect some missed >>> events if you're shoving any kind of dynamic marker into the code. The >>> main problem is if there's something still depending on the first 5 bytes >>> of the function (most likely if there's a loop head somewhere near the >>> top of the function). >>> >> >> I think we have to ensure no threads sleeping or being interrupted on >> the function when removing new function. How would you check it? >> > > Not sure I follow you. You'd never remove any code. But you'd only start You do, when you decide to stop tracing. He is not talking about the old function, that one, indeed will always be there, but what about the new one? When tracing stops we want to remove it and revert to using the old one... But perhaps you are suggesting that the new one, once loaded, stays there forever, that would work, but after several tracing sessions one would have to eventually reboot the machine due to many modules left loaded. - Arnaldo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 11:56 ` Arnaldo Carvalho de Melo @ 2008-04-26 23:38 ` Jeremy Fitzhardinge 2008-04-27 1:00 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-26 23:38 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Jeremy Fitzhardinge, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Arnaldo Carvalho de Melo wrote: > You do, when you decide to stop tracing. He is not talking about the old > function, that one, indeed will always be there, but what about the new > one? When tracing stops we want to remove it and revert to using the old > one... > > But perhaps you are suggesting that the new one, once loaded, stays > there forever, that would work, but after several tracing sessions one > would have to eventually reboot the machine due to many modules left > loaded. As I said, it doesn't seem like a very satisfactory way to solve the problem. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 23:38 ` Jeremy Fitzhardinge @ 2008-04-27 1:00 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 183+ messages in thread From: Arnaldo Carvalho de Melo @ 2008-04-27 1:00 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arnaldo Carvalho de Melo, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Frank Ch. Eigler, systemtap Em Sat, Apr 26, 2008 at 04:38:32PM -0700, Jeremy Fitzhardinge escreveu: > Arnaldo Carvalho de Melo wrote: >> You do, when you decide to stop tracing. He is not talking about the old >> function, that one, indeed will always be there, but what about the new >> one? When tracing stops we want to remove it and revert to using the old >> one... >> >> But perhaps you are suggesting that the new one, once loaded, stays >> there forever, that would work, but after several tracing sessions one >> would have to eventually reboot the machine due to many modules left >> loaded. > > As I said, it doesn't seem like a very satisfactory way to solve the > problem. Indeed :-) - Arnaldo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:04 ` Linus Torvalds 2008-04-25 23:00 ` Mathieu Desnoyers @ 2008-04-26 2:12 ` Frank Ch. Eigler 2008-06-05 17:44 ` Frank Ch. Eigler 2 siblings, 0 replies; 183+ messages in thread From: Frank Ch. Eigler @ 2008-04-26 2:12 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge Linus Torvalds <torvalds@linux-foundation.org> writes: > [...] >> The point is to provide a way to dynamically enable code at runtime >> without noticeable performance impact on the system. > > Quite frankly, maybe I'm a bit dense, but why don't you just recompile the > whole original function (at run-time), load that new version of a function > as a mini-module, and then insert a marker at the top of the old function > that just does a "jmp replacementfunction". [...] You mentioned possible solutions to some of the problems this ambitious an approach would cause. Here are a few more complications: - instrumenting inlined functions - proper sharing of static function data amongst multiple live copies of same function - unknown implications of violating long-standing assumptions about functions not changing addresses - interaction with other code modification machinery (kprobes, ...) - necessity to carry kernel sources & compilers on machines; slow marker activation - FChE ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 22:04 ` Linus Torvalds 2008-04-25 23:00 ` Mathieu Desnoyers 2008-04-26 2:12 ` Frank Ch. Eigler @ 2008-06-05 17:44 ` Frank Ch. Eigler 2 siblings, 0 replies; 183+ messages in thread From: Frank Ch. Eigler @ 2008-06-05 17:44 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Ingo Molnar, peterz, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Linus wrote: > On Fri, 25 Apr 2008, Mathieu Desnoyers wrote: > > The point is to provide a way to dynamically enable code at runtime > > without noticeable performance impact on the system. > > Quite frankly, maybe I'm a bit dense, but why don't you just recompile the > whole original function (at run-time), load that new version of a function > as a mini-module, and then insert a marker at the top of the old function > that just does a "jmp replacementfunction". > > That has _zero_ cost for the non-marker case, and allows you to do pretty > much any arbitrary code changes for the marker case. > [...] > Yeah, this requires you to basically recompile some function snippet when > you insert a probe, but if that scares people, you could basically do it > using the old code and inserting the markers and "relinking" it - avoiding > the C compiler, and just basically have an "assembly recompiler". Linus, was it your intention to signal that you would veto any uses of the current trace_mark mechanism, and wait for this hypothetical function-recompilation-splicing widget as a replacement? This is how some people are interpreting this old thread. A number of problems with the new idea were brought up, and no one appears to have taken interest in trying to build it to see if they can be overcome or if there are more. On the other hand, a number of concerns with the markers have been dealt with since, such as performance numbers showing near-zero impact, and a variety of experience with the few dozen lttng markers and the tools that consume the data. The current debate appears to be stuck on fuzzier aesthetic issues. How are we to move forward? Do you see any *harm* in letting in the lttng markers soon? Could it be that once this "recompile function with instrumentation on the fly" machinery comes into existence eventually, then these exact same marker points could be reinterpreted as one particular potential instrumentation spot? (This could be something as simple as building the kernel with CONFIG_MARKERS=n by default so the markers are compiled out, then having selected alternative functions built with CONFIG_MARKERS=y.) - FChE ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:12 ` Mathieu Desnoyers 2008-04-25 21:15 ` H. Peter Anvin 2008-04-25 22:04 ` Linus Torvalds @ 2008-04-26 6:50 ` Jeremy Fitzhardinge 2008-04-28 0:49 ` Masami Hiramatsu 2 siblings, 1 reply; 183+ messages in thread From: Jeremy Fitzhardinge @ 2008-04-26 6:50 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Mathieu Desnoyers wrote: > * Linus Torvalds (torvalds@linux-foundation.org) wrote: > >> On Fri, 25 Apr 2008, H. Peter Anvin wrote: >> >>> Yes, that should work. It's still ugly, and I have to say I find the >>> complexity rather distasteful. I am willing to be convinced it's worth it, >>> but I would really like to see hard numbers. >>> >> I really cannot imagine that this kind of pain is *ever* worth it. >> >> Please give an example of something so important that we'd want to do >> complex code rewriting on the fly. What _is_ the point of imv_cond()? >> >> Linus >> > > The point is to provide a way to dynamically enable code at runtime > without noticeable performance impact on the system. It's principally > useful to control the markers in the kernel, which can be placed in very > frequently executed code paths. The original markers add a memory read, > test and conditional branch at each marker site. By using the immediate > values patchset, it goes down to a load immediate value, test and branch. > > However, Ingo was still unhappy with the conditional branch, so I cooked > this jump patching optimization on top of the immediate values. I think all this demonstrates that the conditional branch is a bearable cost compared to the alternative. A conditional branch which almost always branches the same way is very predictable, and really shouldn't cost very much. J ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-26 6:50 ` Jeremy Fitzhardinge @ 2008-04-28 0:49 ` Masami Hiramatsu 0 siblings, 0 replies; 183+ messages in thread From: Masami Hiramatsu @ 2008-04-28 0:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mathieu Desnoyers, Linus Torvalds, H. Peter Anvin, Andi Kleen, Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec Jeremy Fitzhardinge wrote: > Mathieu Desnoyers wrote: >> * Linus Torvalds (torvalds@linux-foundation.org) wrote: >> >>> On Fri, 25 Apr 2008, H. Peter Anvin wrote: >>> >>>> Yes, that should work. It's still ugly, and I have to say I find the >>>> complexity rather distasteful. I am willing to be convinced it's worth it, >>>> but I would really like to see hard numbers. >>>> >>> I really cannot imagine that this kind of pain is *ever* worth it. >>> >>> Please give an example of something so important that we'd want to do >>> complex code rewriting on the fly. What _is_ the point of imv_cond()? >>> >>> Linus >>> >> The point is to provide a way to dynamically enable code at runtime >> without noticeable performance impact on the system. It's principally >> useful to control the markers in the kernel, which can be placed in very >> frequently executed code paths. The original markers add a memory read, >> test and conditional branch at each marker site. By using the immediate >> values patchset, it goes down to a load immediate value, test and branch. >> >> However, Ingo was still unhappy with the conditional branch, so I cooked >> this jump patching optimization on top of the immediate values. > > I think all this demonstrates that the conditional branch is a bearable > cost compared to the alternative. A conditional branch which almost > always branches the same way is very predictable, and really shouldn't > cost very much. I agree with you. When I measured the performance of a tracer (LKST) which used conditional branches, the overhead of the conditional branch itself was very small (less than 1%). Moreover, some benchmarks showed the performance of the patched kernel became ~1% faster than before :-) (I guessed that came from changing of memory access pattern and timing.) I think, if someone is considering about the actual performance impacts, we'd better discuss the effects of the individual trace points, based on the actual results of some benchmarks. Thus, we can improve it step by step. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 20:41 ` H. Peter Anvin 2008-04-25 20:51 ` Linus Torvalds @ 2008-04-25 21:02 ` David Miller 2008-04-25 21:11 ` H. Peter Anvin 1 sibling, 1 reply; 183+ messages in thread From: David Miller @ 2008-04-25 21:02 UTC (permalink / raw) To: hpa Cc: mathieu.desnoyers, andi, torvalds, mingo, jirislaby, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, jeremy From: "H. Peter Anvin" <hpa@zytor.com> Date: Fri, 25 Apr 2008 13:41:00 -0700 > Yes, that should work. It's still ugly, and I have to say I find the > complexity rather distasteful. I am willing to be convinced it's worth > it, but I would really like to see hard numbers. This stuff would have been a lot easier if it just worked with normal relocations generated by the assembler, and that would work in such a straightforward way on EVERY architecture. The immediate instance generators could just use macros that architectures define, which are given a range of legal values for the immediate, and the macro emits the inline asm sequence that can support an immediate value of that range. Then we do a half-link of the kernel, collect the unresolved relocations from generated by the immediate macros into a table which gets linked into the kernel, then resolve them in the final link all to zero or some defined initial value. Then it's just a matter of running through the relocation handling we already have for module loading when changing an immediate value. None of this crazy instruction parsing and branch following crap. I can't believe we're seriously considering this crud. :-/ ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 21:02 ` David Miller @ 2008-04-25 21:11 ` H. Peter Anvin 0 siblings, 0 replies; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 21:11 UTC (permalink / raw) To: David Miller Cc: mathieu.desnoyers, andi, torvalds, mingo, jirislaby, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, pageexec, jeremy David Miller wrote: > From: "H. Peter Anvin" <hpa@zytor.com> > Date: Fri, 25 Apr 2008 13:41:00 -0700 > >> Yes, that should work. It's still ugly, and I have to say I find the >> complexity rather distasteful. I am willing to be convinced it's worth >> it, but I would really like to see hard numbers. > > This stuff would have been a lot easier if it just worked with > normal relocations generated by the assembler, and that would > work in such a straightforward way on EVERY architecture. > > The immediate instance generators could just use macros that > architectures define, which are given a range of legal values for the > immediate, and the macro emits the inline asm sequence that can > support an immediate value of that range. > > Then we do a half-link of the kernel, collect the unresolved > relocations from generated by the immediate macros into a table which > gets linked into the kernel, then resolve them in the final link all > to zero or some defined initial value. > > Then it's just a matter of running through the relocation handling > we already have for module loading when changing an immediate > value. > > None of this crazy instruction parsing and branch following crap. > I can't believe we're seriously considering this crud. :-/ That's already there, for all practical purposes. The point of contention here is trying to go from immediate value rewriting to branch rewriting, which is probably the vast majority of all desired uses. However, branch rewriting affects the flow of control, and flow of control is inherently vital for gcc to understand. I'm not inherently opposed to branch target rewriting, but I believe gcc really needs to be involved in the process. On systems compiled with older compilers, we just won't use that feature -- similar to most other features introduced in a new compiler. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:06 ` Linus Torvalds 2008-04-25 16:19 ` Andi Kleen @ 2008-04-25 16:22 ` Ingo Molnar 2008-04-25 16:37 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 16:22 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > + spin_lock_irqsave(&poke_lock, flags); > + set_fixmap(FIX_POKE, phys); > + memcpy((void *)(virt + offset), opcode, len); > + spin_unlock_irqrestore(&poke_lock, flags); hm, right now we've got a debug protection in set_fixmap() to make sure it's only ever called once. So it's going to be a noisy bootup. (but it's a warning only) The patch below removes that. Ingo -------------> Subject: x86: remove set_fixmap() warning From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 18:05:57 CEST 2008 set_fixmap() is safe as long as it's explicitly serialized between all users. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/mm/init_64.c | 3 --- 1 file changed, 3 deletions(-) Index: linux/arch/x86/mm/init_64.c =================================================================== --- linux.orig/arch/x86/mm/init_64.c +++ linux/arch/x86/mm/init_64.c @@ -173,9 +173,6 @@ set_pte_phys(unsigned long vaddr, unsign new_pte = pfn_pte(phys >> PAGE_SHIFT, prot); pte = pte_offset_kernel(pmd, vaddr); - if (!pte_none(*pte) && - pte_val(*pte) != (pte_val(new_pte) & __supported_pte_mask)) - pte_ERROR(*pte); set_pte(pte, new_pte); /* ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:22 ` Ingo Molnar @ 2008-04-25 16:37 ` Linus Torvalds 2008-04-25 16:43 ` Ingo Molnar ` (3 more replies) 0 siblings, 4 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 16:37 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > hm, right now we've got a debug protection in set_fixmap() to make sure > it's only ever called once. So it's going to be a noisy bootup. (but > it's a warning only) The patch below removes that. No, I think the warning is good, I should have done some kind of clear_fixmap() after doing the mmap. But there was actually a much worse problem with my patch: __set_fixmap() is __init. Which means that my patch was just totally broken. What I really wanted to do was to just follow the page tables and mark it writable temporarily over the whole loop, and get rid of the whole mess. (We'd need to make __set_fixmap() non-init, and probably return the pte_t pointer that it used, so that we could then just use "native_pte_clear()" on the thing after having done the memcpy()). I suspect I should have just kept using vmap(), even if I do dislike just how insanely expensive that likely is. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:37 ` Linus Torvalds @ 2008-04-25 16:43 ` Ingo Molnar 2008-04-25 16:45 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 16:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 25 Apr 2008, Ingo Molnar wrote: > > > > hm, right now we've got a debug protection in set_fixmap() to make > > sure it's only ever called once. So it's going to be a noisy bootup. > > (but it's a warning only) The patch below removes that. > > No, I think the warning is good, I should have done some kind of > clear_fixmap() after doing the mmap. yeah - then you need the patch below that makes clear_fixmap() available on 64-bit as well. Ingo ---------------> Subject: x86: make clear_fixmap() available on 64-bit as well From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 18:25:25 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-x86/fixmap.h | 8 ++++++++ include/asm-x86/fixmap_32.h | 7 ++----- include/asm-x86/fixmap_64.h | 4 ++-- 3 files changed, 12 insertions(+), 7 deletions(-) Index: linux/include/asm-x86/fixmap.h =================================================================== --- linux.orig/include/asm-x86/fixmap.h +++ linux/include/asm-x86/fixmap.h @@ -1,5 +1,13 @@ +#ifndef _ASM_FIXMAP_H +#define _ASM_FIXMAP_H + #ifdef CONFIG_X86_32 # include "fixmap_32.h" #else # include "fixmap_64.h" #endif + +#define clear_fixmap(idx) \ + __set_fixmap(idx, 0, __pgprot(0)) + +#endif Index: linux/include/asm-x86/fixmap_32.h =================================================================== --- linux.orig/include/asm-x86/fixmap_32.h +++ linux/include/asm-x86/fixmap_32.h @@ -10,8 +10,8 @@ * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_32_H +#define _ASM_FIXMAP_32_H /* used by vmalloc.c, vsyscall.lds.S. @@ -121,9 +121,6 @@ extern void reserve_top_address(unsigned #define set_fixmap_nocache(idx, phys) \ __set_fixmap(idx, phys, PAGE_KERNEL_NOCACHE) -#define clear_fixmap(idx) \ - __set_fixmap(idx, 0, __pgprot(0)) - #define FIXADDR_TOP ((unsigned long)__FIXADDR_TOP) #define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) Index: linux/include/asm-x86/fixmap_64.h =================================================================== --- linux.orig/include/asm-x86/fixmap_64.h +++ linux/include/asm-x86/fixmap_64.h @@ -8,8 +8,8 @@ * Copyright (C) 1998 Ingo Molnar */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_64_H +#define _ASM_FIXMAP_64_H #include <linux/kernel.h> #include <asm/apicdef.h> ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:37 ` Linus Torvalds 2008-04-25 16:43 ` Ingo Molnar @ 2008-04-25 16:45 ` Ingo Molnar 2008-04-25 16:51 ` Linus Torvalds 2008-04-25 16:52 ` Ingo Molnar 2008-04-25 16:56 ` Andi Kleen 3 siblings, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 16:45 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > But there was actually a much worse problem with my patch: > __set_fixmap() is __init. Which means that my patch was just totally > broken. ah, on 64-bit. That we better make consistent anyway, via the patch below. set_pte_phys() needs to become non-init as well. Ingo -----------> Subject: x86: make __set_fixmap() non-init From: Ingo Molnar <mingo@elte.hu> Date: Fri Apr 25 18:28:21 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/mm/init_64.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) Index: linux/arch/x86/mm/init_64.c =================================================================== --- linux.orig/arch/x86/mm/init_64.c +++ linux/arch/x86/mm/init_64.c @@ -135,7 +135,7 @@ static __init void *spp_getpage(void) return ptr; } -static __init void +static void set_pte_phys(unsigned long vaddr, unsigned long phys, pgprot_t prot) { pgd_t *pgd; @@ -214,8 +214,7 @@ void __init cleanup_highmap(void) } /* NOTE: this is meant to be run only at boot */ -void __init -__set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) +void __set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) { unsigned long address = __fix_to_virt(idx); ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:45 ` Ingo Molnar @ 2008-04-25 16:51 ` Linus Torvalds 2008-04-25 17:02 ` Ingo Molnar 0 siblings, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 16:51 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > ah, on 64-bit. That we better make consistent anyway, via the patch > below. set_pte_phys() needs to become non-init as well. Make it return the "pte_t *", and now you don't have to walk the page tables twice to just clear it immediately afterwards. At that point I think my patch will be happy and useful, but I also worry a bit whether it was worth the changes.. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:51 ` Linus Torvalds @ 2008-04-25 17:02 ` Ingo Molnar 2008-04-25 17:13 ` Linus Torvalds 0 siblings, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 17:02 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 25 Apr 2008, Ingo Molnar wrote: > > > > ah, on 64-bit. That we better make consistent anyway, via the patch > > below. set_pte_phys() needs to become non-init as well. > > Make it return the "pte_t *", and now you don't have to walk the page > tables twice to just clear it immediately afterwards. At that point I > think my patch will be happy and useful, but I also worry a bit > whether it was worth the changes.. performance i dont think we should be too worried about at this moment - this code is so rarely used that it should be driven by robustness i think. one theoretical worry i have is that we've got the pending immediate values changes from Mathieu. Those end up removing the original BUG_ON(len > sizeof(long)) restriction (and the alignment check) and uses a carefully crafted (but scary as hell) sequence of text_poke() sequences to turn a marker into a single-instruction NOP when the marker is inactive. Single-instruction NOP markers is a rather ... tempting goal and it can (and must be able to) patch instructions across page boundaries as well. i think with the PageReserved WARN_ON() we should be sufficiently protected against stray scribbles so Mathieu's fix might be usable as well - see it below. Note that the BUG_ON()s at the end of the text_poke() version below should have caught this bug too i think - because the bug was due to mis-mapping the pages due to the incorrect kernel_text_address() condition so we'd have noticed that the expected bits did not end up in the right place. Ingo -----------------------> Subject: Fix sched-devel text_poke From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Date: Thu, 24 Apr 2008 11:03:33 -0400 Use core_text_address() instead of kernel_text_address(). Deal with modules in the same way used for the core kernel. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/alternative.c | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) Index: linux/arch/x86/kernel/alternative.c =================================================================== --- linux.orig/arch/x86/kernel/alternative.c +++ linux/arch/x86/kernel/alternative.c @@ -511,31 +511,29 @@ void *__kprobes text_poke(void *addr, co unsigned long flags; char *vaddr; int nr_pages = 2; + struct page *pages[2]; + int i; - BUG_ON(len > sizeof(long)); - BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { - struct page *pages[2] = { virt_to_page(addr), - virt_to_page(addr + PAGE_SIZE) }; - if (!pages[1]) - nr_pages = 1; - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - BUG_ON(!vaddr); - local_irq_save(flags); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - local_irq_restore(flags); - vunmap(vaddr); + if (!core_kernel_text((unsigned long)addr)) { + pages[0] = vmalloc_to_page(addr); + pages[1] = vmalloc_to_page(addr + PAGE_SIZE); } else { - /* - * modules are in vmalloc'ed memory, always writable. - */ - local_irq_save(flags); - memcpy(addr, opcode, len); - local_irq_restore(flags); + pages[0] = virt_to_page(addr); + pages[1] = virt_to_page(addr + PAGE_SIZE); } + BUG_ON(!pages[0]); + if (!pages[1]) + nr_pages = 1; + vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + BUG_ON(!vaddr); + local_irq_save(flags); + memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); + local_irq_restore(flags); + vunmap(vaddr); sync_core(); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ + for (i = 0; i < len; i++) + BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]); return addr; } ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:02 ` Ingo Molnar @ 2008-04-25 17:13 ` Linus Torvalds 2008-04-25 17:26 ` Andi Kleen 2008-04-25 17:53 ` Ingo Molnar 0 siblings, 2 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 17:13 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > performance i dont think we should be too worried about at this moment - > this code is so rarely used that it should be driven by robustness i > think. That really isn't true. This isn't done just once. It's done many thousands of times. I agree that it has to be robust, but if we want to make suspend/resume be instantaneous (and we do), performance does actually matter. Yes, this is probably much less of a problem than waiting for devices, and no, I haven't timed it, but if I counted right, we'll literally be going almost ten thousand of these calls over a suspend/resume cycle. That's not "rarely used". Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:13 ` Linus Torvalds @ 2008-04-25 17:26 ` Andi Kleen 2008-04-25 17:29 ` Linus Torvalds 2008-04-25 17:53 ` Ingo Molnar 1 sibling, 1 reply; 183+ messages in thread From: Andi Kleen @ 2008-04-25 17:26 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge > I agree that it has to be robust, but if we want to make suspend/resume > be instantaneous (and we do), performance does actually matter. Yes, this For suspend/resume we can actually just disable all the text_poke()s. They are not needed because we don't expect to stay in single CPU mode for long after wake up and they will just be undone again. I guess if it really was a problem (but really I haven't heard about it) the easiest fix would be to just extended system_state to SYSTEM_SUSPEND and then skip them if that is true. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:26 ` Andi Kleen @ 2008-04-25 17:29 ` Linus Torvalds 0 siblings, 0 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 17:29 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Andi Kleen wrote: > > For suspend/resume we can actually just disable all the text_poke()s. > They are not needed because we don't expect to stay in single CPU > mode for long after wake up and they will just be undone again. I do agree that we might decide to just not do this at all except for the actual physical bootup phase (which can use early_text_poke()). There may not be a whole lot of point to ever play with smp_alterinatives() at any other time. > I guess if it really was a problem (but really I haven't heard about it) > the easiest fix would be to just extended system_state to SYSTEM_SUSPEND > and then skip them if that is true. Our device suspend right now takes about 3.5 seconds (that's using the debug thing, which adds a 5-second timeout). That *is* a problem, but it's historically been hidden by the fact that people are happy that suspend works at all when it does. These days, we're getting to the point (I think) that a lot more people are going to take suspend for granted. And I'd actually like to use it as a sleep state for desktop like usage (let's face it, when the screen goes dark, the CPU should just go into suspend too if it's used as a desktop by non-technical users). And for that to be useful, it needs to come up quickly. Not add another second on top of the already-irritating delay of the screen waking up. Are we there yet? Hell no. But I don't think we're too far off. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:13 ` Linus Torvalds 2008-04-25 17:26 ` Andi Kleen @ 2008-04-25 17:53 ` Ingo Molnar 2008-04-25 18:04 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 17:53 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > performance i dont think we should be too worried about at this > > moment - this code is so rarely used that it should be driven by > > robustness i think. > > That really isn't true. This isn't done just once. It's done many > thousands of times. > > I agree that it has to be robust, but if we want to make > suspend/resume be instantaneous (and we do), performance does actually > matter. Yes, this is probably much less of a problem than waiting for > devices, and no, I haven't timed it, but if I counted right, we'll > literally be going almost ten thousand of these calls over a > suspend/resume cycle. > > That's not "rarely used". yeah, it's done 2800 times on my box with a distro .config. no strong feeling either way - but i dont think there's any cross-CPU TLB flush done in this case within vmap()/vunmap(). Why? Because when alternative_instructions() runs then we have just a single CPU in cpu_online_map. So i think it's only direct vmap()/vunmap() overhead, on a single CPU. We do a kmalloc/kfree which is rather fast - sub-microsecond. We install the pages in the pte's - this is rather fast as well - sub-microsecond. Even assuming cache-cold lines (which they are most of the time) and taken thousands of times that's at most a few milliseconds IMO. In fact, most of the actual vmap() related overhead should be well-cached (the kmalloc bits) - the main cost should come from trashing through all the instruction sites and modifying them. i just measured the actual costs, and the UP/SMP offline/online transition time (with Jiri's patch applied) is: # time echo 0 > /sys/devices/system/cpu/cpu1/online real 0m0.116s user 0m0.000s sys 0m0.008s # time echo 1 > /sys/devices/system/cpu/cpu1/online real 0m0.095s user 0m0.000s sys 0m0.069s with your fixmap patch: # time echo 0 > /sys/devices/system/cpu/cpu1/online real 0m0.110s user 0m0.001s sys 0m0.003s # time echo 1 > /sys/devices/system/cpu/cpu1/online real 0m0.099s user 0m0.000s sys 0m0.072s (i ran it multiple times and picked a representative run) i also did a third control run with a kernel that had alternative_instructions() disabled. The offline/online cost is: # time echo 0 > /sys/devices/system/cpu/cpu1/online real 0m0.108s user 0m0.000s sys 0m0.000s # time echo 1 > /sys/devices/system/cpu/cpu1/online real 0m0.096s user 0m0.000s sys 0m0.068s _perhaps_ there's a decrease in time but i couldnt say it for sure, because in the 'go online' case the numbers are so similar. In the go-offline case there seems to be a gradual decrease but that could be statistical noise. (The user/sys times are not reliable because most of this happens with irqs off, but the 'real' portion should be reliable.) Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:53 ` Ingo Molnar @ 2008-04-25 18:04 ` Ingo Molnar 2008-04-25 18:09 ` Linus Torvalds 2008-04-25 18:13 ` Ingo Molnar 2 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 18:04 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Ingo Molnar <mingo@elte.hu> wrote: > yeah, it's done 2800 times on my box with a distro .config. > > no strong feeling either way - but i dont think there's any cross-CPU > TLB flush done in this case within vmap()/vunmap(). Why? Because when > alternative_instructions() runs then we have just a single CPU in > cpu_online_map. i mean, alternatives_smp_switch(). Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:53 ` Ingo Molnar 2008-04-25 18:04 ` Ingo Molnar @ 2008-04-25 18:09 ` Linus Torvalds 2008-04-25 18:19 ` Ingo Molnar 2008-04-25 18:13 ` Ingo Molnar 2 siblings, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 18:09 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > no strong feeling either way - but i dont think there's any cross-CPU > TLB flush done in this case within vmap()/vunmap(). Why? Because when > alternative_instructions() runs then we have just a single CPU in > cpu_online_map. Ok, fair enough. Without the IPI, I don't think there's a big deal. And you have the numbers to prove it. Consider me convinced. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:09 ` Linus Torvalds @ 2008-04-25 18:19 ` Ingo Molnar 2008-04-25 18:56 ` Ingo Molnar 0 siblings, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 25 Apr 2008, Ingo Molnar wrote: > > > > no strong feeling either way - but i dont think there's any > > cross-CPU TLB flush done in this case within vmap()/vunmap(). Why? > > Because when alternative_instructions() runs then we have just a > > single CPU in cpu_online_map. > > Ok, fair enough. Without the IPI, I don't think there's a big deal. > And you have the numbers to prove it. Consider me convinced. great - i've lined up all the fixes into this git tree which you can pull from: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-x86-fixes4.git for-linus this has Jiri's fix followed by Mathieu's vmap logic cleanups, plus a bit of extra checks and the API extensions for set_fixmap (we didnt end up using them but they make sense nevertheless). Lightly tested though, so even if you agree with the changes you might want to wait an hour with the pull just in case some trivial build issue slipped in. Shortlog and diff below. Ingo ------------------> Ingo Molnar (4): x86: make clear_fixmap() available on 64-bit as well x86: make __set_fixmap() non-init x86: remove set_fixmap() warning x86: harden kernel code patching Jiri Slaby (1): x86: fix text_poke() Mathieu Desnoyers (1): x86: clean up text_poke() arch/x86/kernel/alternative.c | 39 +++++++++++++++++++-------------------- arch/x86/mm/init_64.c | 7 +++---- include/asm-x86/fixmap.h | 8 ++++++++ include/asm-x86/fixmap_32.h | 7 ++----- include/asm-x86/fixmap_64.h | 4 ++-- 5 files changed, 34 insertions(+), 31 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index df4099d..65c7857 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -511,31 +511,30 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len) unsigned long flags; char *vaddr; int nr_pages = 2; + struct page *pages[2]; + int i; - BUG_ON(len > sizeof(long)); - BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { - struct page *pages[2] = { virt_to_page(addr), - virt_to_page(addr + PAGE_SIZE) }; - if (!pages[1]) - nr_pages = 1; - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - BUG_ON(!vaddr); - local_irq_save(flags); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - local_irq_restore(flags); - vunmap(vaddr); + if (!core_kernel_text((unsigned long)addr)) { + pages[0] = vmalloc_to_page(addr); + pages[1] = vmalloc_to_page(addr + PAGE_SIZE); } else { - /* - * modules are in vmalloc'ed memory, always writable. - */ - local_irq_save(flags); - memcpy(addr, opcode, len); - local_irq_restore(flags); + pages[0] = virt_to_page(addr); + WARN_ON(!PageReserved(pages[0])); + pages[1] = virt_to_page(addr + PAGE_SIZE); } + BUG_ON(!pages[0]); + if (!pages[1]) + nr_pages = 1; + vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + BUG_ON(!vaddr); + local_irq_save(flags); + memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); + local_irq_restore(flags); + vunmap(vaddr); sync_core(); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ + for (i = 0; i < len; i++) + BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]); return addr; } diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 1ff7906..b798e7b 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -135,7 +135,7 @@ static __init void *spp_getpage(void) return ptr; } -static __init void +static void set_pte_phys(unsigned long vaddr, unsigned long phys, pgprot_t prot) { pgd_t *pgd; @@ -173,7 +173,7 @@ set_pte_phys(unsigned long vaddr, unsigned long phys, pgprot_t prot) new_pte = pfn_pte(phys >> PAGE_SHIFT, prot); pte = pte_offset_kernel(pmd, vaddr); - if (!pte_none(*pte) && + if (!pte_none(*pte) && pte_val(new_pte) && pte_val(*pte) != (pte_val(new_pte) & __supported_pte_mask)) pte_ERROR(*pte); set_pte(pte, new_pte); @@ -214,8 +214,7 @@ void __init cleanup_highmap(void) } /* NOTE: this is meant to be run only at boot */ -void __init -__set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) +void __set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) { unsigned long address = __fix_to_virt(idx); diff --git a/include/asm-x86/fixmap.h b/include/asm-x86/fixmap.h index 382eb27..5bd2069 100644 --- a/include/asm-x86/fixmap.h +++ b/include/asm-x86/fixmap.h @@ -1,5 +1,13 @@ +#ifndef _ASM_FIXMAP_H +#define _ASM_FIXMAP_H + #ifdef CONFIG_X86_32 # include "fixmap_32.h" #else # include "fixmap_64.h" #endif + +#define clear_fixmap(idx) \ + __set_fixmap(idx, 0, __pgprot(0)) + +#endif diff --git a/include/asm-x86/fixmap_32.h b/include/asm-x86/fixmap_32.h index eb16651..4b96148 100644 --- a/include/asm-x86/fixmap_32.h +++ b/include/asm-x86/fixmap_32.h @@ -10,8 +10,8 @@ * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_32_H +#define _ASM_FIXMAP_32_H /* used by vmalloc.c, vsyscall.lds.S. @@ -121,9 +121,6 @@ extern void reserve_top_address(unsigned long reserve); #define set_fixmap_nocache(idx, phys) \ __set_fixmap(idx, phys, PAGE_KERNEL_NOCACHE) -#define clear_fixmap(idx) \ - __set_fixmap(idx, 0, __pgprot(0)) - #define FIXADDR_TOP ((unsigned long)__FIXADDR_TOP) #define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) diff --git a/include/asm-x86/fixmap_64.h b/include/asm-x86/fixmap_64.h index f3d7685..355d26a 100644 --- a/include/asm-x86/fixmap_64.h +++ b/include/asm-x86/fixmap_64.h @@ -8,8 +8,8 @@ * Copyright (C) 1998 Ingo Molnar */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_64_H +#define _ASM_FIXMAP_64_H #include <linux/kernel.h> #include <asm/apicdef.h> ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 18:19 ` Ingo Molnar @ 2008-04-25 18:56 ` Ingo Molnar 0 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 18:56 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Ingo Molnar <mingo@elte.hu> wrote: > > Ok, fair enough. Without the IPI, I don't think there's a big deal. > > And you have the numbers to prove it. Consider me convinced. > > great - i've lined up all the fixes into this git tree which you can > pull from: > > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-x86-fixes4.git for-linus > > this has Jiri's fix followed by Mathieu's vmap logic cleanups, plus a > bit of extra checks and the API extensions for set_fixmap (we didnt > end up using them but they make sense nevertheless). > > Lightly tested though, so even if you agree with the changes you might > want to wait an hour with the pull just in case some trivial build > issue slipped in. Shortlog and diff below. ok, it's better tested now - 10 random bootups, amongst them 64/32-bit allyesconfig bootups, and some targeted testing as well with offlining/onlining CPUs and no problems found so far. Please pull. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 17:53 ` Ingo Molnar 2008-04-25 18:04 ` Ingo Molnar 2008-04-25 18:09 ` Linus Torvalds @ 2008-04-25 18:13 ` Ingo Molnar 2 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 18:13 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Ingo Molnar <mingo@elte.hu> wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > performance i dont think we should be too worried about at this > > > moment - this code is so rarely used that it should be driven by > > > robustness i think. > > > > That really isn't true. This isn't done just once. It's done many > > thousands of times. > > > > I agree that it has to be robust, but if we want to make > > suspend/resume be instantaneous (and we do), performance does > > actually matter. Yes, this is probably much less of a problem than > > waiting for devices, and no, I haven't timed it, but if I counted > > right, we'll literally be going almost ten thousand of these calls > > over a suspend/resume cycle. > > > > That's not "rarely used". > > yeah, it's done 2800 times on my box with a distro .config. > > no strong feeling either way - but i dont think there's any cross-CPU > TLB flush done in this case within vmap()/vunmap(). Why? Because when > alternative_instructions() runs then we have just a single CPU in > cpu_online_map. > > So i think it's only direct vmap()/vunmap() overhead, on a single CPU. > We do a kmalloc/kfree which is rather fast - sub-microsecond. We > install the pages in the pte's - this is rather fast as well - > sub-microsecond. Even assuming cache-cold lines (which they are most > of the time) and taken thousands of times that's at most a few > milliseconds IMO. > > In fact, most of the actual vmap() related overhead should be > well-cached (the kmalloc bits) - the main cost should come from > trashing through all the instruction sites and modifying them. i just did some direct measurements of alternatives_smp_switch() itself: alternatives took: 7374 usecs alternatives took: 8775 usecs alternatives took: 7498 usecs alternatives took: 8776 usecs that's on a ~2GHz Athlon64 X2 - so not the latest hw. i also added a sysctl to turn alternatives patching on/off, and the CPU offline+online cycle: # alternatives on: real 0m0.152s real 0m0.172s # alternatives off: real 0m0.146s real 0m0.168s so it's measurable and it is in the few milliseconds range. (But there seems to be strong dependency on the kernel image layout or some other detail - compare these timings to my previous timings - they were radically different.) Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:37 ` Linus Torvalds 2008-04-25 16:43 ` Ingo Molnar 2008-04-25 16:45 ` Ingo Molnar @ 2008-04-25 16:52 ` Ingo Molnar 2008-04-25 16:56 ` Andi Kleen 3 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 16:52 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > But there was actually a much worse problem with my patch: > __set_fixmap() is __init. Which means that my patch was just totally > broken. > > What I really wanted to do was to just follow the page tables and mark > it writable temporarily over the whole loop, and get rid of the whole > mess. > > (We'd need to make __set_fixmap() non-init, and probably return the > pte_t pointer that it used, so that we could then just use > "native_pte_clear()" on the thing after having done the memcpy()). > > I suspect I should have just kept using vmap(), even if I do dislike > just how insanely expensive that likely is. clear_fixmap() is OK. I've made a tree with all these fixlets, in the proper order and with the commit logs tidied up: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-x86-fixes3.git for-linus [ i integrated Jiri's commit to before your fix because he really deserves that commit (and more) for his relentless debugging effort. ] below is the full shortlog and diff. Minimally tested on 64-bit so far. Ingo ------------------> Ingo Molnar (3): x86: make clear_fixmap() available on 64-bit as well x86: make __set_fixmap() non-init x86: harden kernel code patching Jiri Slaby (1): x86: fix text_poke() Linus Torvalds (1): x86: clean up text_poke() arch/x86/kernel/alternative.c | 35 +++++++++++++++++++---------------- arch/x86/mm/init_64.c | 5 ++--- include/asm-x86/fixmap.h | 8 ++++++++ include/asm-x86/fixmap_32.h | 8 +++----- include/asm-x86/fixmap_64.h | 5 +++-- 5 files changed, 35 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index df4099d..2e39830 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -508,24 +508,27 @@ void *text_poke_early(void *addr, const void *opcode, size_t len) */ void *__kprobes text_poke(void *addr, const void *opcode, size_t len) { - unsigned long flags; - char *vaddr; - int nr_pages = 2; + static DEFINE_SPINLOCK(poke_lock); + unsigned long flags, bits; + bits = (unsigned long) addr; BUG_ON(len > sizeof(long)); - BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1)) - - ((long)addr & ~(sizeof(long) - 1))); - if (kernel_text_address((unsigned long)addr)) { - struct page *pages[2] = { virt_to_page(addr), - virt_to_page(addr + PAGE_SIZE) }; - if (!pages[1]) - nr_pages = 1; - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - BUG_ON(!vaddr); - local_irq_save(flags); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - local_irq_restore(flags); - vunmap(vaddr); + BUG_ON(len & (len-1)); + BUG_ON(bits & (len-1)); + + if (core_kernel_text(bits)) { + unsigned long phys = __pa(addr); + unsigned long offset = phys & ~PAGE_MASK; + unsigned long virt = fix_to_virt(FIX_POKE); + phys &= PAGE_MASK; + + WARN_ON(!PageReserved(virt_to_page(addr))); + + spin_lock_irqsave(&poke_lock, flags); + set_fixmap(FIX_POKE, phys); + memcpy((void *)(virt + offset), opcode, len); + clear_fixmap(FIX_POKE); + spin_unlock_irqrestore(&poke_lock, flags); } else { /* * modules are in vmalloc'ed memory, always writable. diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 1ff7906..7a81dd0 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -135,7 +135,7 @@ static __init void *spp_getpage(void) return ptr; } -static __init void +static void set_pte_phys(unsigned long vaddr, unsigned long phys, pgprot_t prot) { pgd_t *pgd; @@ -214,8 +214,7 @@ void __init cleanup_highmap(void) } /* NOTE: this is meant to be run only at boot */ -void __init -__set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) +void __set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot) { unsigned long address = __fix_to_virt(idx); diff --git a/include/asm-x86/fixmap.h b/include/asm-x86/fixmap.h index 382eb27..5bd2069 100644 --- a/include/asm-x86/fixmap.h +++ b/include/asm-x86/fixmap.h @@ -1,5 +1,13 @@ +#ifndef _ASM_FIXMAP_H +#define _ASM_FIXMAP_H + #ifdef CONFIG_X86_32 # include "fixmap_32.h" #else # include "fixmap_64.h" #endif + +#define clear_fixmap(idx) \ + __set_fixmap(idx, 0, __pgprot(0)) + +#endif diff --git a/include/asm-x86/fixmap_32.h b/include/asm-x86/fixmap_32.h index eb16651..e5db7d5 100644 --- a/include/asm-x86/fixmap_32.h +++ b/include/asm-x86/fixmap_32.h @@ -10,8 +10,8 @@ * Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999 */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_32_H +#define _ASM_FIXMAP_32_H /* used by vmalloc.c, vsyscall.lds.S. @@ -55,6 +55,7 @@ enum fixed_addresses { FIX_HOLE, FIX_VDSO, FIX_DBGP_BASE, + FIX_POKE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_X86_LOCAL_APIC FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */ @@ -121,9 +122,6 @@ extern void reserve_top_address(unsigned long reserve); #define set_fixmap_nocache(idx, phys) \ __set_fixmap(idx, phys, PAGE_KERNEL_NOCACHE) -#define clear_fixmap(idx) \ - __set_fixmap(idx, 0, __pgprot(0)) - #define FIXADDR_TOP ((unsigned long)__FIXADDR_TOP) #define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) diff --git a/include/asm-x86/fixmap_64.h b/include/asm-x86/fixmap_64.h index f3d7685..ba80e6b 100644 --- a/include/asm-x86/fixmap_64.h +++ b/include/asm-x86/fixmap_64.h @@ -8,8 +8,8 @@ * Copyright (C) 1998 Ingo Molnar */ -#ifndef _ASM_FIXMAP_H -#define _ASM_FIXMAP_H +#ifndef _ASM_FIXMAP_64_H +#define _ASM_FIXMAP_64_H #include <linux/kernel.h> #include <asm/apicdef.h> @@ -37,6 +37,7 @@ enum fixed_addresses { VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE + ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1, VSYSCALL_HPET, + FIX_POKE, FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, FIX_HPET_BASE, ^ permalink raw reply related [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 16:37 ` Linus Torvalds ` (2 preceding siblings ...) 2008-04-25 16:52 ` Ingo Molnar @ 2008-04-25 16:56 ` Andi Kleen 3 siblings, 0 replies; 183+ messages in thread From: Andi Kleen @ 2008-04-25 16:56 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, pageexec, H. Peter Anvin, Jeremy Fitzhardinge > I suspect I should have just kept using vmap(), even if I do dislike just > how insanely expensive that likely is. If it's really a problem it would be better to just batch it and extract it into a separate function. The larger scale callers of text_poke() are loops, so you could just map it once before the loop and then unmap after. But I haven't heard about anyone complaining about this. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:33 ` Linus Torvalds 2008-04-25 15:48 ` Andi Kleen @ 2008-04-25 15:50 ` Ingo Molnar 2008-04-25 15:57 ` H. Peter Anvin 2008-04-25 16:11 ` Linus Torvalds 2008-04-25 15:54 ` Mathieu Desnoyers 2 siblings, 2 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 15:50 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds <torvalds@linux-foundation.org> wrote: > No. That whole code sequence is total and utter crap. It needs to be > rewritten. > > It first does a BUG_ON() if it's not naturally aligned (because that > wouldn't be atomic), and then it has code for page crossing! What a > TOTAL PIECE OF SH*T! > > Hint: > - if it's naturally aligned, it couldn't be page crossing ANYWAY > - and if it was a page-crosser, it sure as hell couldn't be atomic! > > The code is just crap, crap, crap. It needs to be rewritten from > scratch. I'll have a patch soonish. yeah :( it seems that this code only worked because text_poke_early() [which can take arbitrary length and alignment] does most of the patching, it is the real code-patching machinery that is used during early bootup - and that's not used later on. text_poke() itself only applies/unapplies the LOCK prefix - a single byte. We shouldnt be doing that at all: the cost of LOCK is insignificant (a few cycles) and most systems are SMP anyway. any other type of code patching should use stop_machine_run(), where every CPU is stopped with irqs disabled. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:50 ` Ingo Molnar @ 2008-04-25 15:57 ` H. Peter Anvin 2008-04-25 18:53 ` Pavel Machek 2008-04-25 16:11 ` Linus Torvalds 1 sibling, 1 reply; 183+ messages in thread From: H. Peter Anvin @ 2008-04-25 15:57 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, Jeremy Fitzhardinge Ingo Molnar wrote: > > text_poke() itself only applies/unapplies the LOCK prefix - a single > byte. We shouldnt be doing that at all: the cost of LOCK is > insignificant (a few cycles) and most systems are SMP anyway. > Alas, on older CPUs the cost of LOCK can be massive. The question is how much we really care - the embedded people (who would definitely be affected) will simply build UP kernels, and this only affects booting SMP kernels on UP. -hpa ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:57 ` H. Peter Anvin @ 2008-04-25 18:53 ` Pavel Machek 0 siblings, 0 replies; 183+ messages in thread From: Pavel Machek @ 2008-04-25 18:53 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, Jeremy Fitzhardinge On Fri 2008-04-25 08:57:20, H. Peter Anvin wrote: > Ingo Molnar wrote: >> >> text_poke() itself only applies/unapplies the LOCK prefix - a single byte. >> We shouldnt be doing that at all: the cost of LOCK is insignificant (a few >> cycles) and most systems are SMP anyway. >> > > Alas, on older CPUs the cost of LOCK can be massive. The question is how > much we really care - the embedded people (who would definitely be > affected) will simply build UP kernels, and this only affects booting SMP > kernels on UP. Like... say distros on older hardware? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:50 ` Ingo Molnar 2008-04-25 15:57 ` H. Peter Anvin @ 2008-04-25 16:11 ` Linus Torvalds 1 sibling, 0 replies; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 16:11 UTC (permalink / raw) To: Ingo Molnar Cc: Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge On Fri, 25 Apr 2008, Ingo Molnar wrote: > > text_poke() itself only applies/unapplies the LOCK prefix - a single > byte. We shouldnt be doing that at all: the cost of LOCK is > insignificant (a few cycles) and most systems are SMP anyway. No, the cost of LOCK is quite high on a lot of systems. On P4's in particular, since LOCK is serializing, it's about 140 cycles or so (and breaks all speculation). So we definitely want to remove it for any generic kernels. (lock is fairly cheap on AMD K8's, and reportedly on Intel's upcoming Nehalem too, but on Core 2 it's about 35 cycles - quite noticeable, although not nearly the disaster that netburst is) Oh, and text_poke() is also used for inserting the debug instruction for kprobes (and restoring the original byte), but yes, that is always just a single byte too. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:33 ` Linus Torvalds 2008-04-25 15:48 ` Andi Kleen 2008-04-25 15:50 ` Ingo Molnar @ 2008-04-25 15:54 ` Mathieu Desnoyers 2008-04-25 15:59 ` Ingo Molnar 2 siblings, 1 reply; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 15:54 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Fri, 25 Apr 2008, Ingo Molnar wrote: > > > > something like the patch below? (untested) > > No. That whole code sequence is total and utter crap. It needs to be > rewritten. > > It first does a BUG_ON() if it's not naturally aligned (because that > wouldn't be atomic), and then it has code for page crossing! What a TOTAL > PIECE OF SH*T! > > Hint: > - if it's naturally aligned, it couldn't be page crossing ANYWAY > - and if it was a page-crosser, it sure as hell couldn't be atomic! > > The code is just crap, crap, crap. It needs to be rewritten from scratch. > I'll have a patch soonish. > > Linus Woooow, just a sec here. I removed the atomicity test _because_ there happen to be a case where it's safe to do non-atomic instruction modification. If we do : 1) replace the instruction first byte by a breakpoint, execute an instruction bypass (see the immediate values patches for detail) 2) modify the instruction non-atomically 3) put back the original instruction first byte. That's why I removed the BUG_ONs at the beginning of the function. That's also why it's required to deal with page crossing. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:54 ` Mathieu Desnoyers @ 2008-04-25 15:59 ` Ingo Molnar 2008-04-25 16:11 ` Mathieu Desnoyers 0 siblings, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-25 15:59 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > Woooow, just a sec here. I removed the atomicity test _because_ there > happen to be a case where it's safe to do non-atomic instruction > modification. If we do : > > 1) replace the instruction first byte by a breakpoint, execute an > instruction bypass (see the immediate values patches for detail) > 2) modify the instruction non-atomically > 3) put back the original instruction first byte. > > That's why I removed the BUG_ONs at the beginning of the function. > That's also why it's required to deal with page crossing. but the code as-is is nonsensical. It checks for: BUG_ON(len > sizeof(long)); but then deals with page crossing... it should also rename text_poke_early() to text_poke_core(), and call _that_ from text_poke() if core_kernel_text(). From that alone the whole poke_text() function would look a whole lot cleaner. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:59 ` Ingo Molnar @ 2008-04-25 16:11 ` Mathieu Desnoyers 0 siblings, 0 replies; 183+ messages in thread From: Mathieu Desnoyers @ 2008-04-25 16:11 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge * Ingo Molnar (mingo@elte.hu) wrote: > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > Woooow, just a sec here. I removed the atomicity test _because_ there > > happen to be a case where it's safe to do non-atomic instruction > > modification. If we do : > > > > 1) replace the instruction first byte by a breakpoint, execute an > > instruction bypass (see the immediate values patches for detail) > > 2) modify the instruction non-atomically > > 3) put back the original instruction first byte. > > > > That's why I removed the BUG_ONs at the beginning of the function. > > That's also why it's required to deal with page crossing. > > but the code as-is is nonsensical. It checks for: > > BUG_ON(len > sizeof(long)); > > but then deals with page crossing... > That was in the initial version, before my patch, yes. I dealt with page crossing at first, then added a more restrictive test to "play safe" (I should have removed the page-crossing code at that point), but later on noticed that there was a single case where it's valid to do non-atomic updates, and it's when the execution flow is bypassed by a breakpoint (as the immediate values are doing), so the last patch you have removes the restrictive test and lets the page-crossing code in place. > it should also rename text_poke_early() to text_poke_core(), and call > _that_ from text_poke() if core_kernel_text(). From that alone the whole > poke_text() function would look a whole lot cleaner. > hrm, I am not convinced it's safe to call vmap() very early at boot time. In the immediate values implementation, I do call text_poke very very early at boot to populate the initial values. Or maybe are you proposing something different from what I currently understand ? Mathieu > Ingo -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:19 ` [PATCH 1/1] x86: fix text_poke Ingo Molnar 2008-04-25 15:26 ` Ingo Molnar @ 2008-04-25 15:27 ` Andi Kleen 1 sibling, 0 replies; 183+ messages in thread From: Andi Kleen @ 2008-04-25 15:27 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Jiri Slaby, David Miller, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, Mathieu Desnoyers, Andi Kleen, pageexec, H. Peter Anvin, Jeremy Fitzhardinge > i'm wondering what the best sanity checking would be. What we want is to If you enable VIRTUAL_BUG_ON on x86-64 in mmzone_64.h it would have caught it on a NUMA kernel I think. -Andi ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [PATCH 1/1] x86: fix text_poke 2008-04-25 15:03 ` Linus Torvalds 2008-04-25 15:17 ` Andi Kleen 2008-04-25 15:19 ` [PATCH 1/1] x86: fix text_poke Ingo Molnar @ 2008-04-25 20:18 ` David Miller 2 siblings, 0 replies; 183+ messages in thread From: David Miller @ 2008-04-25 20:18 UTC (permalink / raw) To: torvalds Cc: jirislaby, zdenek.kabelac, rjw, paulmck, akpm, linux-ext4, herbert, penberg, clameter, linux-kernel, mathieu.desnoyers, andi, pageexec, hpa, jeremy, mingo From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, 25 Apr 2008 08:03:27 -0700 (PDT) > On Mon, 28 Apr 2008, Jiri Slaby wrote: > > > > Thanks. Bisected mm down to git-x86.patch, bisected git-x86-latest down to > > x86: enhance DEBUG_RODATA support - alternatives > > The patch below fixes the problem for me. Comments welcome. > > You're a hero, Jiri. Indeed, what a heroic effort to fix a bug, thanks Jiri!! ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-24 22:26 ` Jiri Slaby 2008-04-24 22:41 ` Linus Torvalds @ 2008-04-25 1:35 ` David Miller 2008-04-25 1:48 ` Linus Torvalds 2008-04-25 7:42 ` Jiri Slaby 1 sibling, 2 replies; 183+ messages in thread From: David Miller @ 2008-04-25 1:35 UTC (permalink / raw) To: jirislaby Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter From: Jiri Slaby <jirislaby@gmail.com> Date: Fri, 25 Apr 2008 00:26:18 +0200 > ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 0xf0... Is this a 4-cpu machine? I doubt it, because this is on a laptop as far as I can tell, but I thought I'd ask. :-) So the clue is setting some byte at ((offset % 8) == 0) into a structure with 0xf0... ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 1:35 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff David Miller @ 2008-04-25 1:48 ` Linus Torvalds 2008-04-25 1:57 ` David Miller 2008-04-25 7:42 ` Jiri Slaby 1 sibling, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-25 1:48 UTC (permalink / raw) To: David Miller Cc: jirislaby, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter On Thu, 24 Apr 2008, David Miller wrote: > > So the clue is setting some byte at ((offset % 8) == 0) into a > structure with 0xf0... It's not always at (offset % 8) == 0. We've seen that 0xf0 pattern in other oopses, but it's not always 8-byte aligned. In fact, when we've seen it in oopses, it has generally been in the higher bytes (eg offset 5 within a 64-bit word, causing an invalid pointer on x86-64). But that 0xf0 definitely has shown up before. It's not the *only* corruption, but it's definitely a very interesting pattern. And the other ones that didn't show the 0xf0 pattern could obviously be due to pointers that were corrupted by 0xf0 in low bytes, so it _may_ be the source of the other corruptions too that didn't have an obvious 0xf0 directly in them. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 1:48 ` Linus Torvalds @ 2008-04-25 1:57 ` David Miller 2008-04-25 7:41 ` Jiri Slaby 0 siblings, 1 reply; 183+ messages in thread From: David Miller @ 2008-04-25 1:57 UTC (permalink / raw) To: torvalds Cc: jirislaby, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 24 Apr 2008 18:48:32 -0700 (PDT) > But that 0xf0 definitely has shown up before. It's not the *only* > corruption, but it's definitely a very interesting pattern. And the other > ones that didn't show the 0xf0 pattern could obviously be due to pointers > that were corrupted by 0xf0 in low bytes, so it _may_ be the source of the > other corruptions too that didn't have an obvious 0xf0 directly in them. Ok. Do we know of any pattern of the wireless device type in use? If there is a pattern to that, it would be a huge clue. And if it is predominantly one particular wireless device type, we should be able to come up with a patch to test. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 1:57 ` David Miller @ 2008-04-25 7:41 ` Jiri Slaby 2008-04-25 7:45 ` David Miller 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 7:41 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter, Johannes Berg, Michael Wu, Jiri Benc Added 3 80211 experts. On 04/25/2008 03:57 AM, David Miller wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Thu, 24 Apr 2008 18:48:32 -0700 (PDT) > >> But that 0xf0 definitely has shown up before. It's not the *only* >> corruption, but it's definitely a very interesting pattern. And the other >> ones that didn't show the 0xf0 pattern could obviously be due to pointers >> that were corrupted by 0xf0 in low bytes, so it _may_ be the source of the >> other corruptions too that didn't have an obvious 0xf0 directly in them. > > Ok. > > Do we know of any pattern of the wireless device type in use? > If there is a pattern to that, it would be a huge clue. > > And if it is predominantly one particular wireless device type, we > should be able to come up with a patch to test. Johannes, Michael, Jiri? Someone writes to freed memory patterns 0xf0 (not aligned to anything, addressed per byte), one of suspects is mac80211, don't you know that pattern from anywhere? Thanks, Jiri. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:41 ` Jiri Slaby @ 2008-04-25 7:45 ` David Miller 2008-04-25 8:02 ` Jiri Slaby ` (2 more replies) 0 siblings, 3 replies; 183+ messages in thread From: David Miller @ 2008-04-25 7:45 UTC (permalink / raw) To: jirislaby Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter, johannes, flamingice, jbenc From: Jiri Slaby <jirislaby@gmail.com> Date: Fri, 25 Apr 2008 09:41:12 +0200 > Added 3 80211 experts. > > On 04/25/2008 03:57 AM, David Miller wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > Date: Thu, 24 Apr 2008 18:48:32 -0700 (PDT) > > > >> But that 0xf0 definitely has shown up before. It's not the *only* > >> corruption, but it's definitely a very interesting pattern. And the other > >> ones that didn't show the 0xf0 pattern could obviously be due to pointers > >> that were corrupted by 0xf0 in low bytes, so it _may_ be the source of the > >> other corruptions too that didn't have an obvious 0xf0 directly in them. > > > > Ok. > > > > Do we know of any pattern of the wireless device type in use? > > If there is a pattern to that, it would be a huge clue. > > > > And if it is predominantly one particular wireless device type, we > > should be able to come up with a patch to test. > > Johannes, Michael, Jiri? Someone writes to freed memory patterns 0xf0 (not > aligned to anything, addressed per byte), one of suspects is mac80211, don't you > know that pattern from anywhere? I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 chip in there. I took a look at the resume code for ath5k but nothing really suspicious there except: err = pci_enable_device(pdev); if (err) return err; pci_restore_state(pdev); Shouldn't we restore state before we turn the chip back on and thus potentially let it start DMA'ing all over the place? ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:45 ` David Miller @ 2008-04-25 8:02 ` Jiri Slaby 2008-04-25 8:18 ` pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] Jiri Slaby 2008-04-25 10:53 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Craig Schlenter 2 siblings, 0 replies; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 8:02 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter, johannes, flamingice, jbenc On 04/25/2008 09:45 AM, David Miller wrote: > I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 chip > in there. > > I took a look at the resume code for ath5k but nothing really suspicious > there except: > > err = pci_enable_device(pdev); > if (err) > return err; > > pci_restore_state(pdev); > > Shouldn't we restore state before we turn the chip back on and thus > potentially let it start DMA'ing all over the place? Hmm, I cut&pasted that code from somewhere, it seems to be broken. Anyway it worked for a half a year or so. Mending locally. Thanks. ^ permalink raw reply [flat|nested] 183+ messages in thread
* pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] 2008-04-25 7:45 ` David Miller 2008-04-25 8:02 ` Jiri Slaby @ 2008-04-25 8:18 ` Jiri Slaby 2008-04-25 17:11 ` Jesse Barnes 2008-04-25 10:53 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Craig Schlenter 2 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 8:18 UTC (permalink / raw) To: David Miller; +Cc: linux-pci, linux-kernel, Jesse Barnes On 04/25/2008 09:45 AM, David Miller wrote: > I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 chip > in there. > > I took a look at the resume code for ath5k but nothing really suspicious > there except: > > err = pci_enable_device(pdev); > if (err) > return err; > > pci_restore_state(pdev); > > Shouldn't we restore state before we turn the chip back on and thus > potentially let it start DMA'ing all over the place? Hmm, actually every second wireless driver do this :/. I think it's wrong too. Jesse? BTW pci_set_power_state(pdev, PCI_D0); in resume isn't needed at all, right? It's done in pci_enable_device, isn't it? ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] 2008-04-25 8:18 ` pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] Jiri Slaby @ 2008-04-25 17:11 ` Jesse Barnes 0 siblings, 0 replies; 183+ messages in thread From: Jesse Barnes @ 2008-04-25 17:11 UTC (permalink / raw) To: linux-pci; +Cc: Jiri Slaby, David Miller, linux-kernel On Friday, April 25, 2008 1:18 am Jiri Slaby wrote: > On 04/25/2008 09:45 AM, David Miller wrote: > > I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 > > chip in there. > > > > I took a look at the resume code for ath5k but nothing really suspicious > > there except: > > > > err = pci_enable_device(pdev); > > if (err) > > return err; > > > > pci_restore_state(pdev); > > > > Shouldn't we restore state before we turn the chip back on and thus > > potentially let it start DMA'ing all over the place? > > Hmm, actually every second wireless driver do this :/. I think it's wrong > too. Jesse? It might be a little safer to enable the device after restoring its state, but if your device starts DMAing randomly when going from D3->D0 it's probably not a very good device. :) > BTW pci_set_power_state(pdev, PCI_D0); in resume isn't needed at all, > right? It's done in pci_enable_device, isn't it? Right, pci_enable_device will put the device in D0, so setting it again is redundant (looks like lots of drivers do this). Jesse ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:45 ` David Miller 2008-04-25 8:02 ` Jiri Slaby 2008-04-25 8:18 ` pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] Jiri Slaby @ 2008-04-25 10:53 ` Craig Schlenter 2 siblings, 0 replies; 183+ messages in thread From: Craig Schlenter @ 2008-04-25 10:53 UTC (permalink / raw) To: David Miller Cc: jirislaby, torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter, johannes, flamingice, jbenc On Fri, Apr 25, 2008 at 12:45:23AM -0700, David Miller wrote: [snip] > I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 chip > in there. I'm not sure how much code is shared between AR5212 and AR2413 but I saw kmalloc poison issues about a month back with a fedora rawhide kernel on a machine with a AR2413 chipset ... I filed the bug here: http://madwifi.org/ticket/1856 I don't use that machine much so I haven't tried other kernels but perhaps the above info helps in some way. bye, --Craig ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 1:35 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff David Miller 2008-04-25 1:48 ` Linus Torvalds @ 2008-04-25 7:42 ` Jiri Slaby 2008-04-25 7:49 ` David Miller 1 sibling, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 7:42 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter On 04/25/2008 03:35 AM, David Miller wrote: > From: Jiri Slaby <jirislaby@gmail.com> > Date: Fri, 25 Apr 2008 00:26:18 +0200 > >> ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 > > 0xf0... Is this a 4-cpu machine? It's a 2 cores machine. > I doubt it, because this is on a laptop as far as I can tell, > but I thought I'd ask. :-) Well, it's not :), it's a desktop. Jiri ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:42 ` Jiri Slaby @ 2008-04-25 7:49 ` David Miller 2008-04-25 7:56 ` Jiri Slaby 0 siblings, 1 reply; 183+ messages in thread From: David Miller @ 2008-04-25 7:49 UTC (permalink / raw) To: jirislaby Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter From: Jiri Slaby <jirislaby@gmail.com> Date: Fri, 25 Apr 2008 09:42:43 +0200 > On 04/25/2008 03:35 AM, David Miller wrote: > > From: Jiri Slaby <jirislaby@gmail.com> > > Date: Fri, 25 Apr 2008 00:26:18 +0200 > > > >> ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 > > > > 0xf0... Is this a 4-cpu machine? > > It's a 2 cores machine. Two hyperthreads per-core? If so that could match up to the pattern. It is just one theory though. The wireless possibility holds just as much weight. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:49 ` David Miller @ 2008-04-25 7:56 ` Jiri Slaby 2008-04-25 7:58 ` David Miller 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 7:56 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter On 04/25/2008 09:49 AM, David Miller wrote: > From: Jiri Slaby <jirislaby@gmail.com> > Date: Fri, 25 Apr 2008 09:42:43 +0200 > >> On 04/25/2008 03:35 AM, David Miller wrote: >>> From: Jiri Slaby <jirislaby@gmail.com> >>> Date: Fri, 25 Apr 2008 00:26:18 +0200 >>> >>>> ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0 >>> 0xf0... Is this a 4-cpu machine? >> It's a 2 cores machine. > > Two hyperthreads per-core? Hmm, how to find out? I suppose it will show up 4 (virtual) processors in cpuinfo, right? Although there is ht bit in cpuinfo on each core and CONFIG_X86_HT=y, I don't see 4 cpus. > If so that could match up to the pattern. It is just one theory > though. The wireless possibility holds just as much weight. The mac80211 is theory too so far :). ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:56 ` Jiri Slaby @ 2008-04-25 7:58 ` David Miller 2008-04-25 8:00 ` Jiri Slaby 0 siblings, 1 reply; 183+ messages in thread From: David Miller @ 2008-04-25 7:58 UTC (permalink / raw) To: jirislaby Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter From: Jiri Slaby <jirislaby@gmail.com> Date: Fri, 25 Apr 2008 09:56:27 +0200 > On 04/25/2008 09:49 AM, David Miller wrote: > > Two hyperthreads per-core? > > Hmm, how to find out? I suppose it will show up 4 (virtual) processors in > cpuinfo, right? Although there is ht bit in cpuinfo on each core and > CONFIG_X86_HT=y, I don't see 4 cpus. Ok, good to know. > > If so that could match up to the pattern. It is just one theory > > though. The wireless possibility holds just as much weight. > > The mac80211 is theory too so far :). I'll try to do some commit mining of my own. Thanks for all of the info so far. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 7:58 ` David Miller @ 2008-04-25 8:00 ` Jiri Slaby 2008-04-25 15:47 ` Randy Dunlap 0 siblings, 1 reply; 183+ messages in thread From: Jiri Slaby @ 2008-04-25 8:00 UTC (permalink / raw) To: David Miller Cc: torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter On 04/25/2008 09:58 AM, David Miller wrote: >>> If so that could match up to the pattern. It is just one theory >>> though. The wireless possibility holds just as much weight. >> The mac80211 is theory too so far :). > > I'll try to do some commit mining of my own. BTW Doesn't exist any tool to compare diffs? Particularly 2.6.28-rc8-mm1..2.6.28-rc8-mm2 with 2.6.25..2.6.25-git2... I would just give it a try. ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-25 8:00 ` Jiri Slaby @ 2008-04-25 15:47 ` Randy Dunlap 0 siblings, 0 replies; 183+ messages in thread From: Randy Dunlap @ 2008-04-25 15:47 UTC (permalink / raw) To: Jiri Slaby Cc: David Miller, torvalds, zdenek.kabelac, mingo, rjw, paulmck, linux-kernel, akpm, linux-ext4, herbert, penberg, clameter On Fri, 25 Apr 2008 10:00:05 +0200 Jiri Slaby wrote: > On 04/25/2008 09:58 AM, David Miller wrote: > >>> If so that could match up to the pattern. It is just one theory > >>> though. The wireless possibility holds just as much weight. > >> The mac80211 is theory too so far :). > > > > I'll try to do some commit mining of my own. > > BTW Doesn't exist any tool to compare diffs? Particularly > 2.6.28-rc8-mm1..2.6.28-rc8-mm2 with 2.6.25..2.6.25-git2... I would just give it > a try. 'interdiff', part of the patchutils package: http://cyberelk.net/tim/software/patchutils/ --- ~Randy ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 18:35 ` Zdenek Kabelac 2008-04-22 18:48 ` Linus Torvalds @ 2008-04-22 21:46 ` Rafael J. Wysocki 1 sibling, 0 replies; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 21:46 UTC (permalink / raw) To: Zdenek Kabelac Cc: Ingo Molnar, Jiri Slaby, Linus Torvalds, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert On Tuesday, 22 of April 2008, Zdenek Kabelac wrote: > 2008/4/22, Ingo Molnar <mingo@elte.hu>: > > > > * Jiri Slaby <jirislaby@gmail.com> wrote: > > > > >> What do you do to trigger this? Any particular load? Is it still just > > >> doing suspend/resume, or do you have something else that you are > > >> playing with? > > > > > > Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran > > > git-status for a fraction of a second until it was killed. So I can > > > perfectly reproduce it when I suspend, resume and produce some io > > > load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to > > > reproduce it the best and haven't seen that bug in -rc8-mm1 for over > > > week of suspending and working. > > > > > > the most dangerous x86 change we added was the PAT stuff. Does it > > influence the crashes in any way if you boot with 'nopat' or if you > > disable CONFIG_X86_PAT=y into the .config? > > > > the other area was the DMA ops change - that should be rather trivial on > > 64-bit though. > > > Unsure how it is related to my orginal Oops post - but now when I've > debug pagealloc enabled this appeared in my log after resume - should > I open new bug for this - or could this be part of the problem I've > experienced later? > > (Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2 > > sd 0:0:0:0: [sda] Starting disk > mmc0: new SD card at address 5a61 > mmc mmc0:5a61: parent mmc0 is sleeping, will not add > ------------[ cut here ]------------ > WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0() > Modules linked in: tda18271 nls_iso8859_2 nls_cp852 vfat fat i915 drm > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state > nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables > bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc > binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm > snd_hda_intel snd_seq_oss snd_seq_midi_event snd_seq arc4 > snd_seq_device snd_pcm_oss ecb crypto_blkcipher cryptomgr > crypto_algapi iwl3945 snd_mixer_oss mac80211 snd_pcm mmc_block video > sdhci thinkpad_acpi mmc_core i2c_i801 snd_timer rtc_cmos rtc_core > backlight iTCO_wdt cfg80211 evdev snd i2c_core e1000e psmouse > soundcore snd_page_alloc nvram intel_agp rtc_lib iTCO_vendor_support > output serio_raw ac battery button uhci_hcd ohci_hcd ehci_hcd usbcore > [last unloaded: microcode] > Pid: 1240, comm: kmmcd Not tainted 2.6.25 #57 > > Call Trace: > [warn_on_slowpath+95/144] warn_on_slowpath+0x5f/0x90 > [device_pm_add+24/240] ? device_pm_add+0x18/0xf0 > [device_pm_add+108/240] device_pm_add+0x6c/0xf0 > [device_add+1092/1376] device_add+0x444/0x560 > [_end+510110570/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 > [_end+510117927/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 > [_end+510109176/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 > [_end+510109545/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 > [run_workqueue+246/560] run_workqueue+0xf6/0x230 > [worker_thread+167/288] worker_thread+0xa7/0x120 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [worker_thread+0/288] ? worker_thread+0x0/0x120 > [kthread+73/144] kthread+0x49/0x90 > [child_rip+10/18] child_rip+0xa/0x12 > [restore_args+0/48] ? restore_args+0x0/0x30 > [kthread+0/144] ? kthread+0x0/0x90 > [child_rip+0/18] ? child_rip+0x0/0x12 > > ---[ end trace ca143223eefdc828 ]--- > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > IP: [klist_del+29/128] klist_del+0x1d/0x80 > PGD 0 > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC > CPU 0 > Modules linked in: tda18271 nls_iso8859_2 nls_cp852 vfat fat i915 drm > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state > nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables > bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc > binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm > snd_hda_intel snd_seq_oss snd_seq_midi_event snd_seq arc4 > snd_seq_device snd_pcm_oss ecb crypto_blkcipher cryptomgr > crypto_algapi iwl3945 snd_mixer_oss mac80211 snd_pcm mmc_block video > sdhci thinkpad_acpi mmc_core i2c_i801 snd_timer rtc_cmos rtc_core > backlight iTCO_wdt cfg80211 evdev snd i2c_core e1000e psmouse > soundcore snd_page_alloc nvram intel_agp rtc_lib iTCO_vendor_support > output serio_raw ac battery button uhci_hcd ohci_hcd ehci_hcd usbcore > [last unloaded: microcode] > Pid: 1240, comm: kmmcd Not tainted 2.6.25 #57 > RIP: 0010:[klist_del+29/128] [klist_del+29/128] klist_del+0x1d/0x80 > RSP: 0000:ffff81007cabbd00 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003 > RDX: 0000000000000008 RSI: ffffffffa0102308 RDI: 0000000000000000 > RBP: ffff81007cabbd20 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: ffff81007c9a6d10 R12: ffff81007c517530 > R13: ffffffffa0102260 R14: ffff81007cabbdf0 R15: ffff81007c5175a8 > FS: 0000000000000000(0000) GS:ffffffff8148c000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000050 CR3: 0000000001001000 CR4: 00000000000026e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process kmmcd (pid: 1240, threadinfo ffff81007caba000, task ffff81007cac0000) > Stack: ffff81007cabbd10 0000000000000050 ffff81007c5173f8 ffffffffa0102260 > ffff81007cabbd50 ffffffff812012fe ffff81007cabbd50 ffff81007c5173f8 > 00000000fffffff0 ffff81007c5175f0 ffff81007cabbdb0 ffffffff8120016e > Call Trace: > [bus_remove_device+158/208] bus_remove_device+0x9e/0xd0 > [device_add+1358/1376] device_add+0x54e/0x560 > [_end+510110570/2109230024] :mmc_core:mmc_add_card+0xa2/0x140 > hald[2531]: forcibly attempting to lazy unmount /dev/mmcblk0p1 as > enclosing drive was disconnected > [_end+510117927/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860 > [_end+510109176/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0 > [_end+510109545/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0 > [run_workqueue+246/560] run_workqueue+0xf6/0x230 > [worker_thread+167/288] worker_thread+0xa7/0x120 > [autoremove_wake_function+0/64] ? autoremove_wake_function+0x0/0x40 > [worker_thread+0/288] ? worker_thread+0x0/0x120 > [kthread+73/144] kthread+0x49/0x90 > [child_rip+10/18] child_rip+0xa/0x12 > [restore_args+0/48] ? restore_args+0x0/0x30 > [kthread+0/144] ? kthread+0x0/0x90 > [child_rip+0/18] ? child_rip+0x0/0x12 > > > Code: 8b 28 41 0f 95 c7 eb 87 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec > 20 4c 89 65 f0 48 89 5d e8 4c 89 6d f8 49 89 fc 48 8b 1f 48 89 df <4c> > 8b 6b 50 e8 9a 40 01 00 49 8d 7c 24 18 48 c7 c6 20 a4 2d 81 > RIP [klist_del+29/128] klist_del+0x1d/0x80 > RSP <ffff81007cabbd00> > CR2: 0000000000000050 > ---[ end trace ca143223eefdc828 ]--- Zdenek, can you please send me the full dmesg containing this? Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 9:53 ` Ingo Molnar 2008-04-22 18:35 ` Zdenek Kabelac @ 2008-04-22 19:09 ` Ingo Molnar 1 sibling, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-22 19:09 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, Rafael J. Wysocki, paulmck, David Miller, linux-kernel, akpm, linux-ext4, herbert, Zdenek Kabelac, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > > Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran > > git-status for a fraction of a second until it was killed. So I can > > perfectly reproduce it when I suspend, resume and produce some io > > load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to > > reproduce it the best and haven't seen that bug in -rc8-mm1 for over > > week of suspending and working. > > the most dangerous x86 change we added was the PAT stuff. Does it > influence the crashes in any way if you boot with 'nopat' or if you > disable CONFIG_X86_PAT=y into the .config? note that full PAT (where in essence Linux takes over control of the cache attributes via PTEs, instead of relying on the BIOS initialized MTRRs alone) you should only get with -mm or with x86.git applied. I.e. x86 PAT might explain any -mm issue but not the upstream -git issue. In upstream -git we dont have the second wave of the PAT changes applied yet (the /dev/mem bits) so CONFIG_X86_PAT is not yet activated. (it's only safe to enable if we have all the changes together and perfectly control all cache attributes in the system) i.e. PAT complications here would not happen in form of real cache attribute conflicts [i.e. the lockups and corruptions cannot be due to that] - but as side-effects to other code it changes. and most of the PAT failures we ever saw had different patterns anyway: the leading failure was API rejections and hence non-working Xorg or non-working ioremap() in certain drivers. The worst-case scenario, early in the PAT code's cycle, was a spontaneous triple fault - months ago. the basis for the PAT changes was the hardening of the CPA code and its general use for everything (such as DEBUG_PAGEALLOC). And much of that happened and was finished in v2.6.25. Nothing conceptually new really happened there - and even where we touched the code in .26 it happened long ago and would have surfaced by now. ... but ... nothing can be excluded. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 22:54 ` Paul E. McKenney 2008-04-21 23:02 ` Jiri Slaby @ 2008-04-22 1:15 ` Rafael J. Wysocki 2008-04-22 1:25 ` [ProbableSpam]Re: " Paul E. McKenney 1 sibling, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-22 1:15 UTC (permalink / raw) To: paulmck Cc: Jiri Slaby, David Miller, torvalds, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tuesday, 22 of April 2008, Paul E. McKenney wrote: > On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: > > On 04/21/2008 11:58 PM, Jiri Slaby wrote: > > >Leaving untouched. > > > > > >On 04/21/2008 11:18 PM, Jiri Slaby wrote: > > >>On 04/21/2008 10:39 PM, David Miller wrote: > > >>>From: Linus Torvalds <torvalds@linux-foundation.org> > > >>>Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) > > >>> > > >>>>What I find interesting is that at least for me, I have the SLAB > > >>>>bucket size for nf_conntrack_expect being 208 bytes. And the > > >>>>*biggest* merge by far after 2.6.25 so far has been networking (and > > >>>>conntrack in particular) > > >>>> > > >>>>Is that a smoking gun? Not necessarily. But it *is* intriguing. But > > >>>>there are other possible clashes (the 192-byte bucket has several > > >>>>different suspects, and not all of them are in networking).1 > > >>> > > >>>I think you might be onto something here. > > >>> > > >>>The "mask" member of struct nf_conntrack_expect could be reasonably > > >>>all 1's like the value reported in the crash that begins this > > >>>thread. > > >>> > > >>>Do we know the offset within the object at which this all 1's > > >>>value is found? > > >>> > > >>>My rough calculations show that on 32-bit that expect->mask member is > > >>>at offset 56 and on 64-bit it should be at offset 72. Does that > > >>>match up to the offset of the filp or whatever bit being corrupted? > > >> > > >>dentry.d_name.name is 56 on 64-bit (my memcmp crashes) > > >>dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) > > >>dentry.d_op is 136 (crash below) > > > > > >file.f_mapping is 176 (the another one from -rc8-mm2) > > > > > >the one at: > > >http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/9008289.html > > > > > > > > >Having slub_debug enabled, tomorrow will be results, I guess... > > > > Sorry, one more entry: > > > > 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > > 00f0000000000000 dentry.d_hash.next (me, offset 24) > > ffff81f02003f16c dentry.d_name.name (me, offset 56) > > memory ORed by 000000f000000000 > > fffff0002004c1b0 file.f_mapping (me, offset 176) > > memory hole, it was something like > > (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? > > ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) > > -1, ~0ULL > > Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but > there are a couple of patches that need to move from -rt to mainline, > but mostly related to SELinux. So if both PREEMPT_RCU and SELinux > were in use, we might be missing "rcu-various-fixups.patch" from: > > http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.24.4-rt4-broken-out.tar.bz2 My kernel is only voluntarily preemptible (ie. CONFIG_PREEMPT_VOLUNTARY=y). It is an SMP one, however. Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: [ProbableSpam]Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-22 1:15 ` Rafael J. Wysocki @ 2008-04-22 1:25 ` Paul E. McKenney 0 siblings, 0 replies; 183+ messages in thread From: Paul E. McKenney @ 2008-04-22 1:25 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Jiri Slaby, David Miller, torvalds, linux-kernel, mingo, akpm, linux-ext4, herbert, Zdenek Kabelac On Tue, Apr 22, 2008 at 03:15:00AM +0200, Rafael J. Wysocki wrote: > On Tuesday, 22 of April 2008, Paul E. McKenney wrote: > > On Tue, Apr 22, 2008 at 12:26:04AM +0200, Jiri Slaby wrote: > > > On 04/21/2008 11:58 PM, Jiri Slaby wrote: > > > >Leaving untouched. > > > > > > > >On 04/21/2008 11:18 PM, Jiri Slaby wrote: > > > >>On 04/21/2008 10:39 PM, David Miller wrote: > > > >>>From: Linus Torvalds <torvalds@linux-foundation.org> > > > >>>Date: Mon, 21 Apr 2008 09:54:07 -0700 (PDT) > > > >>> > > > >>>>What I find interesting is that at least for me, I have the SLAB > > > >>>>bucket size for nf_conntrack_expect being 208 bytes. And the > > > >>>>*biggest* merge by far after 2.6.25 so far has been networking (and > > > >>>>conntrack in particular) > > > >>>> > > > >>>>Is that a smoking gun? Not necessarily. But it *is* intriguing. But > > > >>>>there are other possible clashes (the 192-byte bucket has several > > > >>>>different suspects, and not all of them are in networking).1 > > > >>> > > > >>>I think you might be onto something here. > > > >>> > > > >>>The "mask" member of struct nf_conntrack_expect could be reasonably > > > >>>all 1's like the value reported in the crash that begins this > > > >>>thread. > > > >>> > > > >>>Do we know the offset within the object at which this all 1's > > > >>>value is found? > > > >>> > > > >>>My rough calculations show that on 32-bit that expect->mask member is > > > >>>at offset 56 and on 64-bit it should be at offset 72. Does that > > > >>>match up to the offset of the filp or whatever bit being corrupted? > > > >> > > > >>dentry.d_name.name is 56 on 64-bit (my memcmp crashes) > > > >>dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one) > > > >>dentry.d_op is 136 (crash below) > > > > > > > >file.f_mapping is 176 (the another one from -rc8-mm2) > > > > > > > >the one at: > > > >http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/9008289.html > > > > > > > > > > > >Having slub_debug enabled, tomorrow will be results, I guess... > > > > > > Sorry, one more entry: > > > > > > 00000000000000f0 dentry.d_op (Zdenek, offset ? around 136) > > > 00f0000000000000 dentry.d_hash.next (me, offset 24) > > > ffff81f02003f16c dentry.d_name.name (me, offset 56) > > > memory ORed by 000000f000000000 > > > fffff0002004c1b0 file.f_mapping (me, offset 176) > > > memory hole, it was something like > > > (ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000? > > > ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24) > > > -1, ~0ULL > > > > Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but > > there are a couple of patches that need to move from -rt to mainline, > > but mostly related to SELinux. So if both PREEMPT_RCU and SELinux > > were in use, we might be missing "rcu-various-fixups.patch" from: > > > > http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.24.4-rt4-broken-out.tar.bz2 > > My kernel is only voluntarily preemptible (ie. CONFIG_PREEMPT_VOLUNTARY=y). > > It is an SMP one, however. Then this patch won't help you. :-/ I submitted separately anyway. Thanx, Paul ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 20:39 ` David Miller 2008-04-21 21:18 ` Jiri Slaby @ 2008-04-21 21:19 ` Linus Torvalds 2008-04-21 21:54 ` David Miller 1 sibling, 1 reply; 183+ messages in thread From: Linus Torvalds @ 2008-04-21 21:19 UTC (permalink / raw) To: David Miller Cc: rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, jirislaby On Mon, 21 Apr 2008, David Miller wrote: > > Do we know the offset within the object at which this all 1's > value is found? > > My rough calculations show that on 32-bit that expect->mask member is > at offset 56 and on 64-bit it should be at offset 72. Does that > match up to the offset of the filp or whatever bit being corrupted? No, I think that the d_hash list is at offset 24 (64-bit). But that changes if any of - GENERIC_LOCKBREAK - DEBUG_SPINLOCK - DEBUG_LOCK_ALLOC (and if so, LOCK_STAT) is set, and then you might actually get to 72. However, the Code: line for one of the oopses shows that in that particular case, it was at offset 0x18 (ie the normal 24), so at least one of the oopses had no such thing going on. Linus ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 21:19 ` Linus Torvalds @ 2008-04-21 21:54 ` David Miller 0 siblings, 0 replies; 183+ messages in thread From: David Miller @ 2008-04-21 21:54 UTC (permalink / raw) To: torvalds Cc: rjw, linux-kernel, mingo, akpm, linux-ext4, herbert, paulmck, jirislaby From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 21 Apr 2008 14:19:26 -0700 (PDT) > However, the Code: line for one of the oopses shows that in that > particular case, it was at offset 0x18 (ie the normal 24), so at least one > of the oopses had no such thing going on. On 64-bit x86_64, which I believe the case you are referring to is, that would be right in the middle of an hlist_node. We would expect to see a valid pointer, NULL, or a list poison value. Which we're not. But I don't think networking or even netfilter can in any way be ruled out yet. A lot of the speculation is because of the SLUB cache sharing between different object types. Is there some way to disable that and see how that influences the bug? Of course, even with sharing disabled a cache's page could get freed and then re-allocated into another cache, but the likelyhood of it happening exactly to a filp or dentry cache from a netfilter or whatever one is extremely unlikely :) ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-20 19:04 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Rafael J. Wysocki 2008-04-20 19:14 ` Rafael J. Wysocki 2008-04-20 21:31 ` Linus Torvalds @ 2008-04-21 13:17 ` Ingo Molnar 2008-04-21 13:35 ` Rafael J. Wysocki 2 siblings, 1 reply; 183+ messages in thread From: Ingo Molnar @ 2008-04-21 13:17 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: LKML, Andrew Morton, Linus Torvalds, linux-ext4 * Rafael J. Wysocki <rjw@sisk.pl> wrote: > Hi, > > I've just got the following traces from 2.6.25-git2 on HP nx6325 > (64-bit). I think they are related to the hang I described yesterday: > [12844.112673] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 > [12844.112683] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd > [12844.112707] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a > [12844.112714] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c so you've got ext3. Nothing changed in the VFS or in ext3 in -git yet. the instruction pattern: Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff ======== shows that you've got "prefetchnta (%esi)" indirect: 0f 18 00 prefetcht0 (%eax) so the prefetch instructions are patched in, neither the compiler nor the CPU should ignore them. Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 13:17 ` Ingo Molnar @ 2008-04-21 13:35 ` Rafael J. Wysocki 2008-04-21 18:56 ` Ingo Molnar 0 siblings, 1 reply; 183+ messages in thread From: Rafael J. Wysocki @ 2008-04-21 13:35 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Andrew Morton, Linus Torvalds, linux-ext4 On Monday, 21 of April 2008, Ingo Molnar wrote: > > * Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > Hi, > > > > I've just got the following traces from 2.6.25-git2 on HP nx6325 > > (64-bit). I think they are related to the hang I described yesterday: > > > [12844.112673] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 > > [12844.112683] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd > > [12844.112707] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a > > [12844.112714] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c > > so you've got ext3. Nothing changed in the VFS or in ext3 in -git yet. > > the instruction pattern: > > Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b > 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b > 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff > ======== > > shows that you've got "prefetchnta (%esi)" indirect: > > 0f 18 00 prefetcht0 (%eax) > > so the prefetch instructions are patched in, neither the compiler nor > the CPU should ignore them. Well, I don't really know what that means ... Besides, that's 64-bit code, but I guess that doesn't matter here. Thanks, Rafael ^ permalink raw reply [flat|nested] 183+ messages in thread
* Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff 2008-04-21 13:35 ` Rafael J. Wysocki @ 2008-04-21 18:56 ` Ingo Molnar 0 siblings, 0 replies; 183+ messages in thread From: Ingo Molnar @ 2008-04-21 18:56 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: LKML, Andrew Morton, Linus Torvalds, linux-ext4 * Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Monday, 21 of April 2008, Ingo Molnar wrote: > > > > * Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > > > Hi, > > > > > > I've just got the following traces from 2.6.25-git2 on HP nx6325 > > > (64-bit). I think they are related to the hang I described yesterday: > > > > > [12844.112673] [<ffffffff8029e236>] do_lookup+0x2c/0x1b2 > > > [12844.112683] [<ffffffff802a04b4>] __link_path_walk+0x8e6/0xdbd > > > [12844.112707] [<ffffffffa004deb4>] ? :ext3:ext3_xattr_get_acl_default+0x18/0x1a > > > [12844.112714] [<ffffffff802b0869>] ? generic_getxattr+0x4e/0x5c > > > > so you've got ext3. Nothing changed in the VFS or in ext3 in -git yet. > > > > the instruction pattern: > > > > Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b > > 45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b > > 00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff > > ======== > > > > shows that you've got "prefetchnta (%esi)" indirect: > > > > 0f 18 00 prefetcht0 (%eax) > > > > so the prefetch instructions are patched in, neither the compiler nor > > the CPU should ignore them. > > Well, I don't really know what that means ... > > Besides, that's 64-bit code, but I guess that doesn't matter here. correct, for 64-bit code that's prefetcht0 (%rax) - a non-destructive 'prefetch stuff from there into the cache' x86 instruction. So real prefetches are done so i'd exclude any true SMP related barrier race. (not that it's likely on x86 hardware anyway - memory barriers usually only matter on Alpha and similar weakly-ordered architectures.) Ingo ^ permalink raw reply [flat|nested] 183+ messages in thread
end of thread, other threads:[~2008-06-05 17:48 UTC | newest] Thread overview: 183+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-04-19 13:22 2.6.25-git1: Solid hang on HP nx6325 (64-bit) Rafael J. Wysocki 2008-04-20 19:04 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Rafael J. Wysocki 2008-04-20 19:14 ` Rafael J. Wysocki 2008-04-20 21:31 ` Linus Torvalds 2008-04-21 1:18 ` Herbert Xu 2008-04-21 2:08 ` Paul E. McKenney 2008-04-21 4:59 ` Paul E. McKenney 2008-04-21 5:47 ` Paul E. McKenney 2008-04-21 13:00 ` Ingo Molnar 2008-04-21 16:06 ` Linus Torvalds 2008-04-21 16:24 ` Rafael J. Wysocki 2008-04-21 15:49 ` Linus Torvalds 2008-04-21 17:05 ` Paul E. McKenney 2008-04-21 17:30 ` Linus Torvalds 2008-04-21 17:43 ` Paul E. McKenney 2008-04-22 1:03 ` Herbert Xu 2008-04-22 13:36 ` Paul E. McKenney 2008-04-21 16:12 ` Rafael J. Wysocki 2008-04-21 16:54 ` Linus Torvalds 2008-04-21 17:06 ` Jiri Slaby 2008-04-21 17:19 ` Rafael J. Wysocki 2008-04-21 17:48 ` Linus Torvalds 2008-04-21 18:22 ` Rafael J. Wysocki 2008-04-21 19:38 ` Jiri Slaby 2008-04-21 20:39 ` David Miller 2008-04-21 21:18 ` Jiri Slaby 2008-04-21 21:58 ` Jiri Slaby 2008-04-21 22:26 ` Jiri Slaby 2008-04-21 22:54 ` Paul E. McKenney 2008-04-21 23:02 ` Jiri Slaby 2008-04-21 23:11 ` Zdenek Kabelac 2008-04-21 23:17 ` Jiri Slaby 2008-04-22 0:54 ` Rafael J. Wysocki 2008-04-22 1:14 ` Linus Torvalds 2008-04-22 1:30 ` Rafael J. Wysocki 2008-04-22 9:49 ` Jiri Slaby 2008-04-22 9:53 ` Ingo Molnar 2008-04-22 18:35 ` Zdenek Kabelac 2008-04-22 18:48 ` Linus Torvalds 2008-04-22 20:34 ` device_pm_add (was: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff) Rafael J. Wysocki 2008-04-22 20:57 ` Rafael J. Wysocki 2008-04-22 22:11 ` Greg KH 2008-04-22 20:58 ` Linus Torvalds 2008-04-22 22:12 ` Greg KH 2008-04-22 22:48 ` Rafael J. Wysocki 2008-04-23 0:50 ` Rafael J. Wysocki 2008-04-23 14:56 ` Alan Stern 2008-04-23 8:50 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Zdenek Kabelac 2008-04-23 15:53 ` Linus Torvalds 2008-04-23 16:58 ` Pekka Enberg 2008-04-23 17:28 ` Zdenek Kabelac 2008-04-23 17:40 ` Ingo Molnar 2008-04-23 18:52 ` Pekka Enberg 2008-04-23 19:05 ` Christoph Lameter 2008-04-23 19:19 ` Pekka J Enberg 2008-04-23 19:28 ` Christoph Lameter 2008-04-23 20:27 ` Zdenek Kabelac 2008-04-24 22:26 ` Jiri Slaby 2008-04-24 22:41 ` Linus Torvalds 2008-04-25 0:57 ` Jiri Slaby 2008-04-24 23:45 ` Linus Torvalds 2008-04-25 7:36 ` Jiri Slaby 2008-04-25 14:09 ` Pavel Machek 2008-04-25 15:30 ` Rafael J. Wysocki 2008-04-25 17:10 ` Jiri Slaby 2008-04-25 9:13 ` David Miller 2008-04-25 12:15 ` Zdenek Kabelac 2008-04-25 12:27 ` Zdenek Kabelac 2008-04-28 0:51 ` [PATCH 1/1] x86: fix text_poke Jiri Slaby 2008-04-25 15:03 ` Linus Torvalds 2008-04-25 15:17 ` Andi Kleen 2008-04-25 19:36 ` Christoph Lameter 2008-04-26 9:59 ` Andi Kleen 2008-04-26 11:16 ` Jiri Slaby 2008-04-26 11:34 ` Andi Kleen 2008-04-28 20:24 ` VIRTUAL_BUG_ON() Christoph Lameter 2008-05-01 19:22 ` [RFC 1/1] mm: add virt to phys debug Jiri Slaby 2008-05-01 20:18 ` Christoph Lameter 2008-05-06 21:54 ` Jiri Slaby 2008-05-07 17:30 ` Christoph Lameter 2008-05-13 14:38 ` Jiri Slaby 2008-04-25 15:19 ` [PATCH 1/1] x86: fix text_poke Ingo Molnar 2008-04-25 15:26 ` Ingo Molnar 2008-04-25 15:32 ` Ingo Molnar 2008-04-25 15:33 ` Linus Torvalds 2008-04-25 15:48 ` Andi Kleen 2008-04-25 16:06 ` Linus Torvalds 2008-04-25 16:19 ` Andi Kleen 2008-04-25 16:24 ` Linus Torvalds 2008-04-25 16:33 ` Ingo Molnar 2008-04-25 18:13 ` Jeremy Fitzhardinge 2008-05-05 2:36 ` Nick Piggin 2008-04-25 16:30 ` Mathieu Desnoyers 2008-04-25 16:42 ` H. Peter Anvin 2008-04-25 17:09 ` Mathieu Desnoyers 2008-04-25 18:37 ` Mathieu Desnoyers 2008-04-25 18:47 ` H. Peter Anvin 2008-04-25 19:19 ` H. Peter Anvin 2008-04-25 20:04 ` Mathieu Desnoyers 2008-04-25 20:09 ` H. Peter Anvin 2008-04-25 20:18 ` H. Peter Anvin 2008-04-25 20:37 ` Mathieu Desnoyers 2008-04-25 20:41 ` H. Peter Anvin 2008-04-25 20:51 ` Linus Torvalds 2008-04-25 21:12 ` Mathieu Desnoyers 2008-04-25 21:15 ` H. Peter Anvin 2008-04-25 21:47 ` Mathieu Desnoyers 2008-04-25 22:07 ` H. Peter Anvin 2008-04-25 22:30 ` Mathieu Desnoyers 2008-04-25 22:36 ` Linus Torvalds 2008-04-28 20:21 ` Ingo Molnar 2008-04-28 20:55 ` Jeremy Fitzhardinge 2008-04-28 21:01 ` H. Peter Anvin 2008-04-28 22:42 ` Mathieu Desnoyers 2008-04-28 20:43 ` Mathieu Desnoyers 2008-04-28 21:02 ` Jeremy Fitzhardinge 2008-05-04 15:03 ` Mathieu Desnoyers 2008-05-04 16:18 ` H. Peter Anvin 2008-04-25 22:38 ` H. Peter Anvin 2008-04-25 22:04 ` Linus Torvalds 2008-04-25 23:00 ` Mathieu Desnoyers 2008-04-25 23:13 ` Jeremy Fitzhardinge 2008-04-25 23:34 ` Masami Hiramatsu 2008-04-26 6:21 ` Jeremy Fitzhardinge 2008-04-26 11:56 ` Arnaldo Carvalho de Melo 2008-04-26 23:38 ` Jeremy Fitzhardinge 2008-04-27 1:00 ` Arnaldo Carvalho de Melo 2008-04-26 2:12 ` Frank Ch. Eigler 2008-06-05 17:44 ` Frank Ch. Eigler 2008-04-26 6:50 ` Jeremy Fitzhardinge 2008-04-28 0:49 ` Masami Hiramatsu 2008-04-25 21:02 ` David Miller 2008-04-25 21:11 ` H. Peter Anvin 2008-04-25 16:22 ` Ingo Molnar 2008-04-25 16:37 ` Linus Torvalds 2008-04-25 16:43 ` Ingo Molnar 2008-04-25 16:45 ` Ingo Molnar 2008-04-25 16:51 ` Linus Torvalds 2008-04-25 17:02 ` Ingo Molnar 2008-04-25 17:13 ` Linus Torvalds 2008-04-25 17:26 ` Andi Kleen 2008-04-25 17:29 ` Linus Torvalds 2008-04-25 17:53 ` Ingo Molnar 2008-04-25 18:04 ` Ingo Molnar 2008-04-25 18:09 ` Linus Torvalds 2008-04-25 18:19 ` Ingo Molnar 2008-04-25 18:56 ` Ingo Molnar 2008-04-25 18:13 ` Ingo Molnar 2008-04-25 16:52 ` Ingo Molnar 2008-04-25 16:56 ` Andi Kleen 2008-04-25 15:50 ` Ingo Molnar 2008-04-25 15:57 ` H. Peter Anvin 2008-04-25 18:53 ` Pavel Machek 2008-04-25 16:11 ` Linus Torvalds 2008-04-25 15:54 ` Mathieu Desnoyers 2008-04-25 15:59 ` Ingo Molnar 2008-04-25 16:11 ` Mathieu Desnoyers 2008-04-25 15:27 ` Andi Kleen 2008-04-25 20:18 ` David Miller 2008-04-25 1:35 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff David Miller 2008-04-25 1:48 ` Linus Torvalds 2008-04-25 1:57 ` David Miller 2008-04-25 7:41 ` Jiri Slaby 2008-04-25 7:45 ` David Miller 2008-04-25 8:02 ` Jiri Slaby 2008-04-25 8:18 ` pci commands resume order [Was: 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff] Jiri Slaby 2008-04-25 17:11 ` Jesse Barnes 2008-04-25 10:53 ` 2.6.25-git2: BUG: unable to handle kernel paging request at ffffffffffffffff Craig Schlenter 2008-04-25 7:42 ` Jiri Slaby 2008-04-25 7:49 ` David Miller 2008-04-25 7:56 ` Jiri Slaby 2008-04-25 7:58 ` David Miller 2008-04-25 8:00 ` Jiri Slaby 2008-04-25 15:47 ` Randy Dunlap 2008-04-22 21:46 ` Rafael J. Wysocki 2008-04-22 19:09 ` Ingo Molnar 2008-04-22 1:15 ` Rafael J. Wysocki 2008-04-22 1:25 ` [ProbableSpam]Re: " Paul E. McKenney 2008-04-21 21:19 ` Linus Torvalds 2008-04-21 21:54 ` David Miller 2008-04-21 13:17 ` Ingo Molnar 2008-04-21 13:35 ` Rafael J. Wysocki 2008-04-21 18:56 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).