* Re: BRSGP relocation truncations in linking kernel for Alpha.
2016-10-25 8:26 ` Michael Cree
@ 2016-10-25 16:58 ` Richard Henderson
2016-10-25 18:07 ` Richard Henderson
2017-02-09 9:58 ` Alpha Kernel Regression [was Re: BRSGP relocation truncations in linking kernel for Alpha.] Michael Cree
2 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2016-10-25 16:58 UTC (permalink / raw)
To: Michael Cree, Helge Deller, linux-alpha
On 10/25/2016 01:26 AM, Michael Cree wrote:
> On Mon, Oct 24, 2016 at 08:13:22PM -0700, Richard Henderson wrote:
>> On 10/24/2016 01:59 PM, Helge Deller wrote:
>>> Does it only happens for __copy_user() ?
>>> If yes, I assume the problem happens because of __copy_tofrom_user_nocheck() in
>>> uaccess.h is always inlined. Maybe un-inlining helps?
>>> Or, the inline-assembly for jsr/bsr needs tweeking ?
>>
>> Indeed. Try changing #ifdef MODULE above __copy_tofrom_user_nocheck to #if 1.
>
> Thanks, that indeed fixes it.
>
> While I have your ears (or eyes since this is email) I have been
> running the ltp tests and noted some syscall test failures. The sendmsg
> call fails with a segfault instead of returning EFAULT, when one passes
> in a NULL for the message arg. When the test is run under gdb it
> reports the segfault in the libc sendmsg function at the instruction
> immediately following the syscall instruction and that instruction is a
> move of a register to another register, not a memory access. Would I
> be correct in surmising that the segfault actually occurred in kernel
> code but gets reported at the instruction following the syscall
> instruction in user space?
That would be my assumption.
> (But looking at the kernel source for
> sys_sendmsg it looks to me that the msg argument is properly checked
> before being accessed so why a segfault should occur is not
> obvious...)
Indeed, I don't see anything wrong with copy_msghdr_from_user.
>
> (I see fstatfs also segfaults when passed a bad pointer but that's
> because libc does not pass the bad address to the kernel, but
> passes the address of a temporary buffer from which libc then
> copies back into the bad address, thus kaboom instead of EFAULT.
> C'est la vie.)
Yep.
>
> And while I mention gdb, it no longer works on Alpha since version
> 7.10. Richard, would you be able to take a look at the bug report:
> https://sourceware.org/bugzilla/show_bug.cgi?id=19061
Ah, yes. I'll put that on my to-do list.
> And while I mention libc I am seeing (rather rare) random segfaults
> in programs such as cp, tar, install and dpkg ever since the upgrade
> to glibc 2.23 (or maybe it was 2.24). I am struggling to get a
> backtrace because it only happens very occassionally (but often enough
> that it is almost impossible for a build and install of large software
> packages such as libreoffice to complete without failure at some
> random point) and when I rerun the failing program manually it then
> always works. I'll keep trying to narrow this one down.
Hmm. I don't recall having seen such a thing. But then, I can't recall the
last time I did anything more than test glibc in situ, which doesn't show these
sorts of things. Perhaps I should do a test install to a sysroot or something...
r~
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BRSGP relocation truncations in linking kernel for Alpha.
2016-10-25 8:26 ` Michael Cree
2016-10-25 16:58 ` Richard Henderson
@ 2016-10-25 18:07 ` Richard Henderson
2016-10-26 6:56 ` Michael Cree
2017-02-09 9:58 ` Alpha Kernel Regression [was Re: BRSGP relocation truncations in linking kernel for Alpha.] Michael Cree
2 siblings, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2016-10-25 18:07 UTC (permalink / raw)
To: Michael Cree, Helge Deller, linux-alpha
On 10/25/2016 01:26 AM, Michael Cree wrote:
> And while I mention gdb, it no longer works on Alpha since version
> 7.10. Richard, would you be able to take a look at the bug report:
> https://sourceware.org/bugzilla/show_bug.cgi?id=19061
In the PR, Pedro has exactly the right pointer to the problem.
From arch/alpha/kernel/traps.c:
info.si_signo = SIGTRAP;
info.si_errno = 0;
info.si_code = TRAP_BRKPT;
info.si_trapno = 0;
info.si_addr = (void __user *) regs->pc;
if (ptrace_cancel_bpt(current)) {
regs->pc -= 4; /* make pc point to former bpt */
}
So we report the same si_code for executing a breakpoint insn inserted by gdb,
and a "hardware" breakpoint managed by the kernel. But for the later, we
already back up the PC.
So gdb winds up backing up the PC twice.
This ought to be fixed by using TRAP_HWBKPT (4) for the ptrace_cancel_bpt case,
but telling gdb about the issue in gdb/nat/linux-ptrace.c like so:
#elif defined __alpha__
# define GDB_ARCH_IS_TRAP_BRKPT(X) ((X) == TRAP_BRKPT)
# define GDB_ARCH_IS_TRAP_HWBKPT(X) ((X) == TRAP_BRKPT || (X) == TRAP_HWBKPT)
which looks confusing, but does get checked:
if (GDB_ARCH_IS_TRAP_BRKPT (siginfo.si_code)
&& GDB_ARCH_IS_TRAP_HWBKPT (siginfo.si_code))
{
/* The si_code is ambiguous on this arch -- check debug
registers. */
if (!check_stopped_by_watchpoint (lp))
lp->stop_reason = TARGET_STOPPED_BY_SW_BREAKPOINT;
but at the moment the default definition of GDB_ARCH_IS_TRAP_HWBKPT is always
false for alpha.
Another fix would be to completely disable gdb's use of "hardware" breakpoints
for alpha. Are they really more efficient than letting gdb manage everything?
r~
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BRSGP relocation truncations in linking kernel for Alpha.
2016-10-25 18:07 ` Richard Henderson
@ 2016-10-26 6:56 ` Michael Cree
2016-10-26 15:18 ` Richard Henderson
0 siblings, 1 reply; 10+ messages in thread
From: Michael Cree @ 2016-10-26 6:56 UTC (permalink / raw)
To: Richard Henderson; +Cc: Helge Deller, linux-alpha
On Tue, Oct 25, 2016 at 11:07:38AM -0700, Richard Henderson wrote:
> On 10/25/2016 01:26 AM, Michael Cree wrote:
> > And while I mention gdb, it no longer works on Alpha since version
> > 7.10. Richard, would you be able to take a look at the bug report:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=19061
>
> In the PR, Pedro has exactly the right pointer to the problem.
>
> >From arch/alpha/kernel/traps.c:
>
> info.si_signo = SIGTRAP;
> info.si_errno = 0;
> info.si_code = TRAP_BRKPT;
> info.si_trapno = 0;
> info.si_addr = (void __user *) regs->pc;
>
> if (ptrace_cancel_bpt(current)) {
> regs->pc -= 4; /* make pc point to former bpt */
> }
>
> So we report the same si_code for executing a breakpoint insn inserted by gdb,
> and a "hardware" breakpoint managed by the kernel. But for the later, we
> already back up the PC.
>
> So gdb winds up backing up the PC twice.
>
> This ought to be fixed by using TRAP_HWBKPT (4) for the ptrace_cancel_bpt case,
> but telling gdb about the issue in gdb/nat/linux-ptrace.c like so:
>
> #elif defined __alpha__
> # define GDB_ARCH_IS_TRAP_BRKPT(X) ((X) == TRAP_BRKPT)
> # define GDB_ARCH_IS_TRAP_HWBKPT(X) ((X) == TRAP_BRKPT || (X) == TRAP_HWBKPT)
>
> which looks confusing, but does get checked:
>
> if (GDB_ARCH_IS_TRAP_BRKPT (siginfo.si_code)
> && GDB_ARCH_IS_TRAP_HWBKPT (siginfo.si_code))
> {
> /* The si_code is ambiguous on this arch -- check debug
> registers. */
> if (!check_stopped_by_watchpoint (lp))
> lp->stop_reason = TARGET_STOPPED_BY_SW_BREAKPOINT;
>
> but at the moment the default definition of GDB_ARCH_IS_TRAP_HWBKPT is always
> false for alpha.
By saying "This ought to be fixed by [...] but at the moment [...]"
are you saying that the fix provided above will not work? Indeed,
it doesn't work: I tried it and just about every test in the gdb test
suite still fails. For example:
Running
/home/mjc/toolchain/gdb-build/gdb/testsuite/../../../binutils-gdb/gdb/testsuite/gdb.base/break.exp
...
FAIL: gdb.base/break.exp: run until function breakpoint (timeout)
FAIL: gdb.base/break.exp: list marker1 (timeout)
FAIL: gdb.base/break.exp: break lineno (timeout)
FAIL: gdb.base/break.exp: delete $bpnum (timeout)
FAIL: gdb.base/break.exp: run until breakpoint set at a line number
(timeout)
FAIL: gdb.base/break.exp: run until file:function(6) breakpoint
(timeout)
Cheers
Michael.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BRSGP relocation truncations in linking kernel for Alpha.
2016-10-26 6:56 ` Michael Cree
@ 2016-10-26 15:18 ` Richard Henderson
0 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2016-10-26 15:18 UTC (permalink / raw)
To: Michael Cree, Helge Deller, linux-alpha
On 10/25/2016 11:56 PM, Michael Cree wrote:
> On Tue, Oct 25, 2016 at 11:07:38AM -0700, Richard Henderson wrote:
>> On 10/25/2016 01:26 AM, Michael Cree wrote:
>>> And while I mention gdb, it no longer works on Alpha since version
>>> 7.10. Richard, would you be able to take a look at the bug report:
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=19061
>>
>> In the PR, Pedro has exactly the right pointer to the problem.
>>
>> >From arch/alpha/kernel/traps.c:
>>
>> info.si_signo = SIGTRAP;
>> info.si_errno = 0;
>> info.si_code = TRAP_BRKPT;
>> info.si_trapno = 0;
>> info.si_addr = (void __user *) regs->pc;
>>
>> if (ptrace_cancel_bpt(current)) {
>> regs->pc -= 4; /* make pc point to former bpt */
>> }
>>
>> So we report the same si_code for executing a breakpoint insn inserted by gdb,
>> and a "hardware" breakpoint managed by the kernel. But for the later, we
>> already back up the PC.
>>
>> So gdb winds up backing up the PC twice.
>>
>> This ought to be fixed by using TRAP_HWBKPT (4) for the ptrace_cancel_bpt case,
>> but telling gdb about the issue in gdb/nat/linux-ptrace.c like so:
>>
>> #elif defined __alpha__
>> # define GDB_ARCH_IS_TRAP_BRKPT(X) ((X) == TRAP_BRKPT)
>> # define GDB_ARCH_IS_TRAP_HWBKPT(X) ((X) == TRAP_BRKPT || (X) == TRAP_HWBKPT)
>>
>> which looks confusing, but does get checked:
>>
>> if (GDB_ARCH_IS_TRAP_BRKPT (siginfo.si_code)
>> && GDB_ARCH_IS_TRAP_HWBKPT (siginfo.si_code))
>> {
>> /* The si_code is ambiguous on this arch -- check debug
>> registers. */
>> if (!check_stopped_by_watchpoint (lp))
>> lp->stop_reason = TARGET_STOPPED_BY_SW_BREAKPOINT;
>>
>> but at the moment the default definition of GDB_ARCH_IS_TRAP_HWBKPT is always
>> false for alpha.
>
> By saying "This ought to be fixed by [...] but at the moment [...]"
> are you saying that the fix provided above will not work?
What I meant is "this is what I'm going to try". And then forgot about the
(annoyingly long) build I'd left running yesterday.
> Indeed,
> it doesn't work: I tried it and just about every test in the gdb test
> suite still fails. For example:
>
> Running
> /home/mjc/toolchain/gdb-build/gdb/testsuite/../../../binutils-gdb/gdb/testsuite/gdb.base/break.exp
> ...
> FAIL: gdb.base/break.exp: run until function breakpoint (timeout)
> FAIL: gdb.base/break.exp: list marker1 (timeout)
> FAIL: gdb.base/break.exp: break lineno (timeout)
> FAIL: gdb.base/break.exp: delete $bpnum (timeout)
> FAIL: gdb.base/break.exp: run until breakpoint set at a line number
> (timeout)
> FAIL: gdb.base/break.exp: run until file:function(6) breakpoint
> (timeout)
Ok. Well, I hope to get back to it soon.
r~
^ permalink raw reply [flat|nested] 10+ messages in thread
* Alpha Kernel Regression [was Re: BRSGP relocation truncations in linking kernel for Alpha.]
2016-10-25 8:26 ` Michael Cree
2016-10-25 16:58 ` Richard Henderson
2016-10-25 18:07 ` Richard Henderson
@ 2017-02-09 9:58 ` Michael Cree
2 siblings, 0 replies; 10+ messages in thread
From: Michael Cree @ 2017-02-09 9:58 UTC (permalink / raw)
To: Richard Henderson, linux-alpha, Chen Feng
Cc: Helge Deller, LKML, John Paul Adrian Glaubitz, Michael Karcher
On Tue, Oct 25, 2016 at 09:26:38PM +1300, Michael Cree wrote:
> And while I mention libc I am seeing (rather rare) random segfaults
> in programs such as cp, tar, install and dpkg ever since the upgrade
> to glibc 2.23 (or maybe it was 2.24). I am struggling to get a
> backtrace because it only happens very occassionally (but often enough
> that it is almost impossible for a build and install of large software
> packages such as libreoffice to complete without failure at some
> random point) and when I rerun the failing program manually it then
> always works. I'll keep trying to narrow this one down.
It's not glibc. Downgrading to previously known working versions does
not solve the random segfaults. But downgrading the kernel does fix
the problem on Alpha. Noted that 4.6 is good but 4.6.7 is bad so
bisected the 4.6.y stable kernel branch to get the first bad commit as
0784672d05684de901fc2aa56150d7ea9a475a2d, i.e.:
commit 0784672d05684de901fc2aa56150d7ea9a475a2d
Author: Chen Feng <puck.chen@hisilicon.com>
Date: Fri May 20 16:59:02 2016 -0700
mm/compaction.c: fix zoneindex in kcompactd()
commit 6cd9dc3e75078ef646076fa63adfb9b85ced0b66 upstream.
While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
found the kcompactd never wakeup. It seems the zoneindex has already
minus 1 before. So the traverse here should be <=.
It fixes a regression where kswapd could previously compact, but
kcompactd not. Not a crash fix though.
Cheers
Michael.
^ permalink raw reply [flat|nested] 10+ messages in thread