Re: Segmentation fault with drgn + libkdumpfile

From: Stephen Brennan <stephen.s.brennan@oracle.com>
To: "Petr Tesařík" <petr@tesarici.cz>
Cc: Omar Sandoval <osandov@osandov.com>, linux-debuggers@vger.kernel.org
Subject: Re: Segmentation fault with drgn + libkdumpfile
Date: Wed, 10 Jan 2024 11:48:22 -0800	[thread overview]
Message-ID: <87ply9ynp5.fsf@oracle.com> (raw)
In-Reply-To: <20240110190354.0ff56adc@meshulam.tesarici.cz>

Petr Tesařík <petr@tesarici.cz> writes:
> On Wed, 10 Jan 2024 14:49:28 +0100
> Petr Tesařík <petr@tesarici.cz> wrote:
>
>> On Wed, 10 Jan 2024 09:36:00 +0100
>> Petr Tesařík <petr@tesarici.cz> wrote:
>> 
>> > On Tue, 09 Jan 2024 17:40:15 -0800
>> > Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > 
>> > > Petr Tesařík <petr@tesarici.cz> writes:
>> > >   
>> > > > On Mon, 8 Jan 2024 21:40:08 +0100
>> > > > Petr Tesařík <petr@tesarici.cz> wrote:
>> > > >    
>> > > >> On Fri, 05 Jan 2024 13:53:15 -0800
>> > > >> Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > > >>     
>> > > >> > Petr Tesařík <petr@tesarici.cz> writes:      
>> > > >> > > On Fri, 05 Jan 2024 10:38:16 -0800
>> > > >> > > Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > > >> > >        
>> > > >> > >> Hi Petr,
>> > > >> > >> 
>> > > >> > >> I recently encountered a segmentation fault with libkdumpfile & drgn
>> > > >> > >> which appears to be related to the cache implementation. I've included
>> > > >> > >> the stack trace at the end of this message, since it's a bit of a longer
>> > > >> > >> one. The exact issue occurred with a test vmcore that I could probably
>> > > >> > >> share with you privately if you'd like. In any case, the reproducer is
>> > > >> > >> fairly straightforward in drgn code:
>> > > >> > >> 
>> > > >> > >> for t in for_each_task(prog):
>> > > >> > >>     prog.stack_trace(t)
>> > > >> > >> for t in for_each_task(prog):
>> > > >> > >>     prog.stack_trace(t)
>> > > >> > >> 
>> > > >> > >> The repetition is required, the segfault only occurs on the second
>> > > >> > >> iteration of the loop. Which, in hindsight, is a textbook sign that the
>> > > >> > >> issue has to do with caching. I'd expect that the issue is specific to
>> > > >> > >> this vmcore, it doesn't reproduce on others.
>> > > >> > >> 
>> > > >> > >> I stuck that into a git bisect script and bisected the libkdumpfile
>> > > >> > >> commit that introduced it:
>> > > >> > >> 
>> > > >> > >> commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6
>> > > >> > >> Author: Petr Tesarik <petr@tesarici.cz>
>> > > >> > >> Date:   Wed Jan 11 22:53:01 2023 +0100
>> > > >> > >> 
>> > > >> > >>     Cache: Calculate eprobe in reinit_entry()
>> > > >> > >> 
>> > > >> > >>     If this function is called to reuse a ghost entry, the probe list
>> > > >> > >>     has not been walked yet, so eprobe is left uninitialized.
>> > > >> > >> 
>> > > >> > >>     This passed the test case, because the correct old value was left
>> > > >> > >>     on stack. Modify the test case to poison the stack.
>> > > >> > >> 
>> > > >> > >>     Signed-off-by: Petr Tesarik <petr@tesarici.cz>
>> > > >> > >> 
>> > > >> > >>  src/kdumpfile/cache.c      |  6 +++++-
>> > > >> > >>  src/kdumpfile/test-cache.c | 13 +++++++++++++
>> > > >> > >>  2 files changed, 18 insertions(+), 1 deletion(-)        
>> > > >> > >
>> > > >> > > This looks like a red herring to me. The cache most likely continues in
>> > > >> > > a corrupted state without this commit, which may mask the issue (until
>> > > >> > > it resurfaces later).        
>> > > >> > 
>> > > >> > I see, that makes a lot of sense.
>> > > >> >       
>> > > >> > >> I haven't yet tried to debug the logic of the cache implementation and
>> > > >> > >> create a patch. I'm totally willing to try that, but I figured I would
>> > > >> > >> send this report to you first, to see if there's something obvious that
>> > > >> > >> sticks out to your eyes.        
>> > > >> > >
>> > > >> > > No, but I should be able to recreate the issue if I get a log of the
>> > > >> > > cache API calls:
>> > > >> > >
>> > > >> > > - cache_alloc() - to know the number of elements
>> > > >> > > - cache_get_entry()
>> > > >> > > - cache_put_entry()
>> > > >> > > - cache_insert()
>> > > >> > > - cache_discard()
>> > > >> > > - cache_flush() - not likely after initialization, but...        
>> > > >> > 
>> > > >> > I went ahead and logged each of these calls as you suggested, I tried to
>> > > >> > log them at the beginning of the function call and always include the
>> > > >> > cache pointer, cache_entry, and the key. I took the resulting log and
>> > > >> > filtered it to just contain the most recently logged cache prior to the
>> > > >> > crash, compressed it, and attached it. For completeness, the patch
>> > > >> > I used is below (applies to tip branch 8254897 ("Merge pull request #78
>> > > >> > from fweimer-rh/c99")).
>> > > >> > 
>> > > >> > I'll also see if I can reproduce it based on the log.      
>> > > >> 
>> > > >> Thank you for the log. I haven't had much time to look at it, but the
>> > > >> first line is a good hint already:
>> > > >> 
>> > > >> 0x56098b68c4c0: cache_alloc(1024, 0)
>> > > >> 
>> > > >> Zero size means the data pointers are managed by the caller, so this
>> > > >> must be the cache of mmap()'ed segments. That's the only cache which
>> > > >> installs a cleanup callback with set_cache_entry_cleanup(). There is
>> > > >> only one call to the cleanup callback for evicted entries in cache.c:
>> > > >> 
>> > > >> 		/* Get an unused cached entry. */
>> > > >> 		if (cs->nuprobe != 0 &&
>> > > >> 		    (cs->nuprec == 0 || cache->nprobe + bias > cache->dprobe))
>> > > >> 			evict = evict_probe(cache, cs);
>> > > >> 		else
>> > > >> 			evict = evict_prec(cache, cs);
>> > > >> 		if (cache->entry_cleanup)
>> > > >> 			cache->entry_cleanup(cache->cleanup_data, evict);
>> > > >> 
>> > > >> The entries can be evicted from the probe partition or from the precious
>> > > >> partition. This might be relevant. Please, can you re-run and log where
>> > > >> the evict entry comes from?    
>> > > >
>> > > > I found some time this morning, and it wouldn't help. Because of a bug
>> > > > in fcache_new(), the number of elements in the cache is big enough that
>> > > > cache entries are never evicted in your case. It's quite weird to hit a
>> > > > cache metadata bug after elements have been inserted. FWIW I am not
>> > > > able to reproduce the bug by replaying the logged file read pattern.
>> > > >
>> > > > Since you have a reliable reproducer, it cannot be a Heisenbug. But it
>> > > > could be caused by the other cache - the cache of decompressed pages.
>> > > > Do you know for sure that lzo1x_decompress_safe() crashes while trying
>> > > > to _read_ from the input buffer, and not while trying to _write_ to the
>> > > > output buffer?    
>> > > 
>> > > Hi Petr,
>> > > 
>> > > Sorry for the delay here, I got pulled into other issues and am trying
>> > > to attend to all my work in a round-robin fashion :)  
>> > 
>> > Hi Stephen,
>> > 
>> > that's fine, I also work on this only as time permits. ;-)
>> > 
>> > > The fault is definitely in lzo1x_decompress_safe() *writing* to address
>> > > 0. I fetched debuginfo for all the necessary libraries and we see the
>> > > following stack trace:
>> > > 
>> > > %<-----------------------
>> > > #0  0x00007fcd9adddef3 in lzo1x_decompress_safe (in=<optimized out>,
>> > >     in_len=<optimized out>, out=0x0, out_len=0x7ffdee2c1388, wrkmem=<optimized out>)
>> > >     at src/lzo1x_d.ch:120
>> > > #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
>> > > #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>> > >     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
>> > > #3  0x00007fcd9ae25e44 in diskdump_get_page (pio=0x7ffdee2c1590) at diskdump.c:647
>> > > #4  0x00007fcd9ae32be0 in get_page (pio=0x7ffdee2c1590)
>> > >     at /home/stepbren/repos/libkdumpfile/src/kdumpfile/kdumpfile-priv.h:1512
>> > > #5  0x00007fcd9ae32ed4 in get_page_xlat (pio=0x7ffdee2c1590) at read.c:126
>> > > #6  0x00007fcd9ae32f22 in get_page_maybe_xlat (pio=0x7ffdee2c1590) at read.c:137
>> > > #7  0x00007fcd9ae32fb1 in _kdumpfile_priv_read_locked (ctx=0x55745bfca8f0,
>> > >     as=KDUMP_KVADDR, addr=18446612133360081960, buffer=0x7ffdee2c17df,
>> > >     plength=0x7ffdee2c1698) at read.c:169
>> > > #8  0x00007fcd9ae330dd in kdump_read (ctx=0x55745bfca8f0, as=KDUMP_KVADDR,
>> > >     addr=18446612133360081960, buffer=0x7ffdee2c17df, plength=0x7ffdee2c1698)
>> > >     at read.c:196
>> > > #9  0x00007fcd9afb0cc4 in drgn_read_kdump (buf=0x7ffdee2c17df,
>> > >     address=18446612133360081960, count=4, offset=18446612133360081960,
>> > >     arg=0x55745bfca8f0, physical=false) at ../../libdrgn/kdump.c:73
>> > > %<-----------------------
>> > > 
>> > > In frame 1 where we are calling the decompressor:
>> > > 
>> > > %<-----------------------
>> > > (gdb) frame 1
>> > > #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
>> > > 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
>> > > (gdb) list
>> > > 579                     if (ret != KDUMP_OK)
>> > > 580                             return ret;
>> > > 581             } else if (pd.flags & DUMP_DH_COMPRESSED_LZO) {
>> > > 582     #if USE_LZO
>> > > 583                     lzo_uint retlen = get_page_size(ctx);
>> > > 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
>> > > 585                                                     pio->chunk.data,
>> > > 586                                                     &retlen,
>> > > 587                                                     LZO1X_MEM_DECOMPRESS);
>> > > 588                     fcache_put_chunk(&fch);
>> > > (gdb) p retlen
>> > > $7 = 0  
>> > 
>> > This is a bit weird. Looking at liblzo sources, it seems to me that
>> > the output length is not changed until right before returning from
>> > lzo1x_decompress_safe().
>> > 
>> > > (gdb) p pio->chunk.data
>> > > $8 = (void *) 0x0  
>> > 
>> > OK, here's our immediate root cause. ;-)
>> > 
>> > > (gdb) p fch.data
>> > > $9 = (void *) 0x7fcd7cc33da4  
>> > 
>> > This looks sane.
>> > 
>> > > (gdb) p pd.size
>> > > $10 = 816  
>> > 
>> > This also looks sane.
>> > 
>> > > %<-----------------------
>> > > 
>> > > As far as I can tell, pio->chunk.data comes directly from the
>> > > cache_get_page() function in frame 2:
>> > > 
>> > > %<-----------------------
>> > > (gdb) up
>> > > #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>> > >     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
>> > > 69              ret = fn(pio);
>> > > (gdb) list
>> > > 64              pio->chunk.data = entry->data;
>> > > 65              pio->chunk.embed_fces->ce = entry;
>> > > 66              if (cache_entry_valid(entry))
>> > > 67                      return KDUMP_OK;
>> > > 68
>> > > 69              ret = fn(pio);
>> > > 70              mutex_lock(&ctx->shared->cache_lock);
>> > > 71              if (ret == KDUMP_OK)
>> > > 72                      cache_insert(pio->chunk.embed_fces->cache, entry);
>> > > 73              else
>> > > (gdb) p *entry
>> > > $11 = {key = 1045860353, state = cs_precious, next = 626, prev = 626, refcnt = 1,
>> > >   data = 0x0}  
>> > 
>> > The key (0x3e569000 | ADDRXLAT_MACHPHYSADDR) corresponds to the
>> > requested virtual address 0xffff88003e569c28.
>> > 
>> > > (gdb) p *pio
>> > > $12 = {ctx = 0x55745bfca8f0, addr = {addr = 1045860352, as = ADDRXLAT_MACHPHYSADDR},
>> > >   chunk = {data = 0x0, nent = 1, {embed_fces = {{data = 0xffff880ff1470788,
>> > >           len = 140728599320032, ce = 0x55745c1003d8, cache = 0x55745c0fb540}, {
>> > >           data = 0x55745bfd42f0, len = 140728599320112,
>> > >           ce = 0x7fcd9ae330ef <kdump_read+102>, cache = 0xffff88003e569c28}},
>> > >       fces = 0xffff880ff1470788}}}
>> > > %<-----------------------  
>> > 
>> > 
>> > Looking at pio->chunk->embed_fces->ce, struct cache_entry is at
>> > 0x55745c1003d8. Assuming that sizeof(struct cache_entry) == 32 on your
>> > system, this is element 626 in the cache entry array. The next and
>> > prev indices are also 626, which looks good, because cache->ninflight
>> > is 1, so this is the only element in the (circular) in-flight list.
>> > 
>> > Since state is cs_precious, but the data was discarded, this cache
>> > entry has just been recovered from a ghost partition, evicting another
>> > entry, and _that_ entry had a NULL data pointer.
>> > 
>> > It would be really helpful if I could get the log for this cache
>> > instead of the one you posted earlier.
>> 
>> Let's recap:
>> 
>> 1. Data indicates that a reused ghost entry has a NULL data pointer.
>> 2. Reverting commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6 masks the issue.
>> 
>> My conclusion is that the data pointer was taken from an entry in the
>> unused partition. This partition is empty, except when the cache is
>> new, or after calling cache_discard(). Given the statistic counter
>> values, the latter is the case here,
>> 
>> I think I found the bug: reinit_entry() finds the unused partition by
>> skipping over the ghost probe partition, but if the target itself is a
>> reused ghost probe entry, cache->ngprobe was already decremented in
>> get_ghost_or_missed_entry().
>> 
>> I'm going to write a test case and fix.
>
> I was able to write a test case for the NULL data pointer. The bug
> should be fixed by commit e63dab40c8cefdfc013dc57915de9852dbb283e4.
>
> Stephen, can you rebuild and verify with your dump file that this was
> indeed the same bug?

Hi Petr,

I went ahead and pulled the latest from tip and ran my reproducer. The
segmentation fault is indeed gone!

I tested c10036e ("cache: Optimize data reclaim for missed entries"), to
be exact, which contains e63dab4 as well.

Thank you so much for taking the time to go back & forth with me on this
one! I suppose there's a reason that caching is one of the two hard
problems in computer science ;)

I owe you a beverage of your choice next time we see each other!

Thanks,
Stephen