* [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later [not found] ` <1420623640.14527843.1575289859701.JavaMail.zimbra@redhat.com> @ 2019-12-03 12:50 ` Jan Stancek 2019-12-03 13:07 ` Christoph Hellwig 2019-12-03 19:09 ` Christoph Hellwig 0 siblings, 2 replies; 9+ messages in thread From: Jan Stancek @ 2019-12-03 12:50 UTC (permalink / raw) To: linux-xfs, linux-fsdevel, hch, darrick.wong Cc: linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman Hi, (This bug report is summary from thread [1] with some additions) User-space binaries on Power9 ppc64le (with 64k pages) on xfs filesystem are sporadically hitting SIGBUS: ---------- 8< ---------- (gdb) r Starting program: /mnt/testarea/ltp/testcases/bin/genasin Program received signal SIGBUS, Bus error. dl_main (phdr=0x10000040, phnum=<optimized out>, user_entry=0x7fffffffe760, auxv=<optimized out>) at rtld.c:1362 1362 switch (ph->p_type) (gdb) p ph $1 = (const Elf64_Phdr *) 0x10000040 (gdb) p *ph Cannot access memory at address 0x10000040 (gdb) info proc map process 1110670 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x10000000 0x10010000 0x10000 0x0 /mnt/testarea/ltp/testcases/bin/genasin 0x10010000 0x10030000 0x20000 0x0 /mnt/testarea/ltp/testcases/bin/genasin 0x7ffff7f90000 0x7ffff7fb0000 0x20000 0x0 [vdso] 0x7ffff7fb0000 0x7ffff7fe0000 0x30000 0x0 /usr/lib64/ld-2.30.so 0x7ffff7fe0000 0x7ffff8000000 0x20000 0x20000 /usr/lib64/ld-2.30.so 0x7ffffffd0000 0x800000000000 0x30000 0x0 [stack] (gdb) x/1x 0x10000040 0x10000040: Cannot access memory at address 0x10000040 ---------- >8 ---------- When this happens the binary continues to hit SIGBUS until page is released, for example by: echo 3 > /proc/sys/vm/drop_caches The issue goes back to at least v4.19. I can semi-reliably reproduce it with LTP is installed to /mnt/testarea/ltp by: while [ True ]; do echo 3 > /proc/sys/vm/drop_caches rm -f /mnt/testarea/ltp/results/RUNTEST.log /mnt/testarea/ltp/output/RUNTEST.run.log ./runltp -p -d results -l RUNTEST.log -o RUNTEST.run.log -f math grep FAIL /mnt/testarea/ltp/results/RUNTEST.log && exit 1 done and some stress activity in other terminal (e.g. kernel build). Sometimes in minutes, sometimes in hours. It is not reliable enough to get meaningful bisect results. My theory is that there's a race in iomap. There appear to be interleaved calls to iomap_set_range_uptodate() for same page with varying offset and length. Each call sees bitmap as _not_ entirely "uptodate" and hence doesn't call SetPageUptodate(). Even though each bit in bitmap ends up uptodate by the time all calls finish. For example, with following traces: iomap_set_range_uptodate() ... if (uptodate && !PageError(page)) SetPageUptodate(page); + + if (mycheck(page)) { + trace_printk("page: %px, iop: %px, uptodate: %d, !PageError(page): %d, flags: %lx\n", page, iop, uptodate, !PageError(page), page->flags); + trace_printk("first: %u, last: %u, off: %u, len: %u, i: %u\n", first, last, off, len, i); + } I get: genacos-18471 [057] .... 162.465730: iomap_readpages: mapping: c000003f185a1ab0 genacos-18471 [057] .... 162.465732: iomap_page_create: iomap_page_create page: c00c00000fe26180, page->private: 0000000000000000, iop: c000003fc70a19c0, flags: 3ffff800000001 genacos-18471 [057] .... 162.465736: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002001 genacos-18471 [057] .... 162.465736: iomap_set_range_uptodate: first: 1, last: 14, off: 4096, len: 57344, i: 16 <idle>-0 [060] ..s. 162.534862: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002081 <idle>-0 [061] ..s. 162.534862: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002081 <idle>-0 [060] ..s. 162.534864: iomap_set_range_uptodate: first: 0, last: 0, off: 0, len: 4096, i: 16 <idle>-0 [061] ..s. 162.534864: iomap_set_range_uptodate: first: 15, last: 15, off: 61440, len: 4096, i: 16 This page doesn't have Uptodate flag set, which leads to filemap_fault() returning VM_FAULT_SIGBUS: crash> p/x ((struct page *) 0xc00c00000fe26180)->flags $1 = 0x3ffff800002032 crash> kmem -g 0x3ffff800002032 FLAGS: 3ffff800002032 PAGE-FLAG BIT VALUE PG_error 1 0000002 PG_dirty 4 0000010 PG_lru 5 0000020 PG_private_2 13 0002000 PG_fscache 13 0002000 PG_savepinned 4 0000010 PG_double_map 13 0002000 But iomap_page->uptodate in page->private suggests all bits are uptodate: crash> p/x ((struct page *) 0xc00c00000fe26180)->private $2 = 0xc000003fc70a19c0 crash> p/x ((struct iomap_page *) 0xc000003fc70a19c0)->uptodate $3 = {0xffff, 0x0} It appears (after ~4 hours) that I can avoid the problem if I split the loop so that bits are checked only after all updates are made. Not sure if this correct approach, or just making it less reproducible: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index e25901ae3ff4..abe37031c93d 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -131,7 +131,11 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { if (i >= first && i <= last) set_bit(i, iop->uptodate); - else if (!test_bit(i, iop->uptodate)) + } + for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { + if (i >= first && i <= last) + continue; + if (!test_bit(i, iop->uptodate)) uptodate = false; } } Thanks, Jan [1] https://lore.kernel.org/stable/1420623640.14527843.1575289859701.JavaMail.zimbra@redhat.com/T/#u ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-03 12:50 ` [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later Jan Stancek @ 2019-12-03 13:07 ` Christoph Hellwig 2019-12-03 14:35 ` Jan Stancek 2019-12-03 19:09 ` Christoph Hellwig 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2019-12-03 13:07 UTC (permalink / raw) To: Jan Stancek Cc: linux-xfs, linux-fsdevel, hch, darrick.wong, linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote: > My theory is that there's a race in iomap. There appear to be > interleaved calls to iomap_set_range_uptodate() for same page > with varying offset and length. Each call sees bitmap as _not_ > entirely "uptodate" and hence doesn't call SetPageUptodate(). > Even though each bit in bitmap ends up uptodate by the time > all calls finish. Weird. That should be prevented by the page lock that all callers of iomap_set_range_uptodate. But in case I miss something, does the patch below trigger? If not it is not jut a race, but might be some weird ordering problem with the bitops, especially if it only triggers on ppc, which is very weakly ordered. diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index d33c7bc5ee92..25e942c71590 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) unsigned int i; bool uptodate = true; + WARN_ON_ONCE(!PageLocked(page)); + if (iop) { for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { if (i >= first && i <= last) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-03 13:07 ` Christoph Hellwig @ 2019-12-03 14:35 ` Jan Stancek 2019-12-03 16:08 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: Jan Stancek @ 2019-12-03 14:35 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-xfs, linux-fsdevel, darrick wong, linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman ----- Original Message ----- > On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote: > > My theory is that there's a race in iomap. There appear to be > > interleaved calls to iomap_set_range_uptodate() for same page > > with varying offset and length. Each call sees bitmap as _not_ > > entirely "uptodate" and hence doesn't call SetPageUptodate(). > > Even though each bit in bitmap ends up uptodate by the time > > all calls finish. > > Weird. That should be prevented by the page lock that all callers > of iomap_set_range_uptodate. But in case I miss something, does > the patch below trigger? If not it is not jut a race, but might > be some weird ordering problem with the bitops, especially if it > only triggers on ppc, which is very weakly ordered. > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index d33c7bc5ee92..25e942c71590 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned off, > unsigned len) > unsigned int i; > bool uptodate = true; > > + WARN_ON_ONCE(!PageLocked(page)); > + > if (iop) { > for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { > if (i >= first && i <= last) > Hit it pretty quick this time: # uptime 09:27:42 up 22 min, 2 users, load average: 0.09, 13.38, 26.18 # /mnt/testarea/ltp/testcases/bin/genbessel Bus error (core dumped) # dmesg | grep -i -e warn -e call [ 0.000000] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or unsupported by kernel) [ 0.000000] random: get_random_u64 called from cache_random_seq_create+0x98/0x1e0 with crng_init=0 [ 0.000000] rcu: Offload RCU callbacks from CPUs: (none). [ 5.312075] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 5.357307] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 5.485126] megaraid_sas 0031:01:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 So, extra WARN_ON_ONCE applied on top of v5.4-8836-g81b6b96475ac did not trigger. Is it possible for iomap code to submit multiple bio-s for same locked page and then receive callbacks in parallel? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-03 14:35 ` Jan Stancek @ 2019-12-03 16:08 ` Darrick J. Wong 0 siblings, 0 replies; 9+ messages in thread From: Darrick J. Wong @ 2019-12-03 16:08 UTC (permalink / raw) To: Jan Stancek Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman On Tue, Dec 03, 2019 at 09:35:28AM -0500, Jan Stancek wrote: > > ----- Original Message ----- > > On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote: > > > My theory is that there's a race in iomap. There appear to be > > > interleaved calls to iomap_set_range_uptodate() for same page > > > with varying offset and length. Each call sees bitmap as _not_ > > > entirely "uptodate" and hence doesn't call SetPageUptodate(). > > > Even though each bit in bitmap ends up uptodate by the time > > > all calls finish. > > > > Weird. That should be prevented by the page lock that all callers > > of iomap_set_range_uptodate. But in case I miss something, does > > the patch below trigger? If not it is not jut a race, but might > > be some weird ordering problem with the bitops, especially if it > > only triggers on ppc, which is very weakly ordered. > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index d33c7bc5ee92..25e942c71590 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned off, > > unsigned len) > > unsigned int i; > > bool uptodate = true; > > > > + WARN_ON_ONCE(!PageLocked(page)); > > + > > if (iop) { > > for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { > > if (i >= first && i <= last) > > > > Hit it pretty quick this time: > > # uptime > 09:27:42 up 22 min, 2 users, load average: 0.09, 13.38, 26.18 > > # /mnt/testarea/ltp/testcases/bin/genbessel > Bus error (core dumped) > > # dmesg | grep -i -e warn -e call > [ 0.000000] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or unsupported by kernel) > [ 0.000000] random: get_random_u64 called from cache_random_seq_create+0x98/0x1e0 with crng_init=0 > [ 0.000000] rcu: Offload RCU callbacks from CPUs: (none). > [ 5.312075] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 > [ 5.357307] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 > [ 5.485126] megaraid_sas 0031:01:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 > > So, extra WARN_ON_ONCE applied on top of v5.4-8836-g81b6b96475ac > did not trigger. > > Is it possible for iomap code to submit multiple bio-s for same > locked page and then receive callbacks in parallel? Yes, if (say) you have 64k pages on a 4k-block filesystem and the extent mapping for all 16 blocks aren't contiguous, then iomap will issue separate bios for each physical fragment it finds. iomap will call submit_bio on those bios whenever it thinks it's done filling the bio, so you can indeed get multiple callbacks in parallel. --D ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-03 12:50 ` [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later Jan Stancek 2019-12-03 13:07 ` Christoph Hellwig @ 2019-12-03 19:09 ` Christoph Hellwig 2019-12-04 14:43 ` Jan Stancek 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2019-12-03 19:09 UTC (permalink / raw) To: Jan Stancek Cc: linux-xfs, linux-fsdevel, hch, darrick.wong, linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman Please try the patch below: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 512856a88106..340c15400423 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -28,6 +28,7 @@ struct iomap_page { atomic_t read_count; atomic_t write_count; + spinlock_t uptodate_lock; DECLARE_BITMAP(uptodate, PAGE_SIZE / 512); }; @@ -51,6 +52,7 @@ iomap_page_create(struct inode *inode, struct page *page) iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); atomic_set(&iop->read_count, 0); atomic_set(&iop->write_count, 0); + spin_lock_init(&iop->uptodate_lock); bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE); /* @@ -139,25 +141,38 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop, } static void -iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) +iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) { struct iomap_page *iop = to_iomap_page(page); struct inode *inode = page->mapping->host; unsigned first = off >> inode->i_blkbits; unsigned last = (off + len - 1) >> inode->i_blkbits; - unsigned int i; bool uptodate = true; + unsigned long flags; + unsigned int i; - if (iop) { - for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { - if (i >= first && i <= last) - set_bit(i, iop->uptodate); - else if (!test_bit(i, iop->uptodate)) - uptodate = false; - } + spin_lock_irqsave(&iop->uptodate_lock, flags); + for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { + if (i >= first && i <= last) + set_bit(i, iop->uptodate); + else if (!test_bit(i, iop->uptodate)) + uptodate = false; } - if (uptodate && !PageError(page)) + if (uptodate) + SetPageUptodate(page); + spin_unlock_irqrestore(&iop->uptodate_lock, flags); +} + +static void +iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) +{ + if (PageError(page)) + return; + + if (page_has_private(page)) + iomap_iop_set_range_uptodate(page, off, len); + else SetPageUptodate(page); } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-03 19:09 ` Christoph Hellwig @ 2019-12-04 14:43 ` Jan Stancek 0 siblings, 0 replies; 9+ messages in thread From: Jan Stancek @ 2019-12-04 14:43 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-xfs, linux-fsdevel, darrick wong, linuxppc-dev, Memory Management, LTP Mailing List, Linux Stable maillist, CKI Project, Michael Ellerman ----- Original Message ----- > Please try the patch below: I ran reproducer for 18 hours on 2 systems were it previously reproduced, there were no crashes / SIGBUS. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <9c0af967-4916-4e8b-e77f-087515793d77@free.fr>]
* [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later [not found] <9c0af967-4916-4e8b-e77f-087515793d77@free.fr> @ 2019-12-07 0:09 ` dftxbs3e 2019-12-08 20:30 ` Eric Sandeen 0 siblings, 1 reply; 9+ messages in thread From: dftxbs3e @ 2019-12-07 0:09 UTC (permalink / raw) To: Jan Stancek, linux-xfs, linux-fsdevel, hch, darrick.wong, linuxppc-dev, Memory Management, LTP Mailing List, CKI Project, Michael Ellerman Hello! I am very happy that someone has found this issue. I have been suffering from rather random SIGBUS errors in similar conditions described by the author. I don't have much troubleshooting information to provide, however, I hit the issue regularly so I could investigate during that. How do you debug such an issue? I tried a debugger etc. but besides crashing with SIGBUS, I couldnt get any other meaningful information. dftxbs3e On 12/3/19 1:50 PM, Jan Stancek wrote: > Hi, > > (This bug report is summary from thread [1] with some additions) > > User-space binaries on Power9 ppc64le (with 64k pages) on xfs > filesystem are sporadically hitting SIGBUS: > > ---------- 8< ---------- > (gdb) r > Starting program: /mnt/testarea/ltp/testcases/bin/genasin > > Program received signal SIGBUS, Bus error. > dl_main (phdr=0x10000040, phnum=<optimized out>, user_entry=0x7fffffffe760, auxv=<optimized out>) at rtld.c:1362 > 1362 switch (ph->p_type) > > (gdb) p ph > $1 = (const Elf64_Phdr *) 0x10000040 > > (gdb) p *ph > Cannot access memory at address 0x10000040 > > (gdb) info proc map > process 1110670 > Mapped address spaces: > > Start Addr End Addr Size Offset objfile > 0x10000000 0x10010000 0x10000 0x0 /mnt/testarea/ltp/testcases/bin/genasin > 0x10010000 0x10030000 0x20000 0x0 /mnt/testarea/ltp/testcases/bin/genasin > 0x7ffff7f90000 0x7ffff7fb0000 0x20000 0x0 [vdso] > 0x7ffff7fb0000 0x7ffff7fe0000 0x30000 0x0 /usr/lib64/ld-2.30.so > 0x7ffff7fe0000 0x7ffff8000000 0x20000 0x20000 /usr/lib64/ld-2.30.so > 0x7ffffffd0000 0x800000000000 0x30000 0x0 [stack] > > (gdb) x/1x 0x10000040 > 0x10000040: Cannot access memory at address 0x10000040 > ---------- >8 ---------- > > When this happens the binary continues to hit SIGBUS until page > is released, for example by: echo 3 > /proc/sys/vm/drop_caches > > The issue goes back to at least v4.19. > > I can semi-reliably reproduce it with LTP is installed to /mnt/testarea/ltp by: > while [ True ]; do > echo 3 > /proc/sys/vm/drop_caches > rm -f /mnt/testarea/ltp/results/RUNTEST.log /mnt/testarea/ltp/output/RUNTEST.run.log > ./runltp -p -d results -l RUNTEST.log -o RUNTEST.run.log -f math > grep FAIL /mnt/testarea/ltp/results/RUNTEST.log && exit 1 > done > > and some stress activity in other terminal (e.g. kernel build). > Sometimes in minutes, sometimes in hours. It is not reliable > enough to get meaningful bisect results. > > My theory is that there's a race in iomap. There appear to be > interleaved calls to iomap_set_range_uptodate() for same page > with varying offset and length. Each call sees bitmap as _not_ > entirely "uptodate" and hence doesn't call SetPageUptodate(). > Even though each bit in bitmap ends up uptodate by the time > all calls finish. > > For example, with following traces: > > iomap_set_range_uptodate() > ... > if (uptodate && !PageError(page)) > SetPageUptodate(page); > + > + if (mycheck(page)) { > + trace_printk("page: %px, iop: %px, uptodate: %d, !PageError(page): %d, flags: %lx\n", page, iop, uptodate, !PageError(page), page->flags); > + trace_printk("first: %u, last: %u, off: %u, len: %u, i: %u\n", first, last, off, len, i); > + } > > I get: > genacos-18471 [057] .... 162.465730: iomap_readpages: mapping: c000003f185a1ab0 > genacos-18471 [057] .... 162.465732: iomap_page_create: iomap_page_create page: c00c00000fe26180, page->private: 0000000000000000, iop: c000003fc70a19c0, flags: 3ffff800000001 > genacos-18471 [057] .... 162.465736: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002001 > genacos-18471 [057] .... 162.465736: iomap_set_range_uptodate: first: 1, last: 14, off: 4096, len: 57344, i: 16 > <idle>-0 [060] ..s. 162.534862: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002081 > <idle>-0 [061] ..s. 162.534862: iomap_set_range_uptodate: page: c00c00000fe26180, iop: c000003fc70a19c0, uptodate: 0, !PageError(page): 1, flags: 3ffff800002081 > <idle>-0 [060] ..s. 162.534864: iomap_set_range_uptodate: first: 0, last: 0, off: 0, len: 4096, i: 16 > <idle>-0 [061] ..s. 162.534864: iomap_set_range_uptodate: first: 15, last: 15, off: 61440, len: 4096, i: 16 > > This page doesn't have Uptodate flag set, which leads to filemap_fault() > returning VM_FAULT_SIGBUS: > > crash> p/x ((struct page *) 0xc00c00000fe26180)->flags > $1 = 0x3ffff800002032 > > crash> kmem -g 0x3ffff800002032 > FLAGS: 3ffff800002032 > PAGE-FLAG BIT VALUE > PG_error 1 0000002 > PG_dirty 4 0000010 > PG_lru 5 0000020 > PG_private_2 13 0002000 > PG_fscache 13 0002000 > PG_savepinned 4 0000010 > PG_double_map 13 0002000 > > But iomap_page->uptodate in page->private suggests all bits are uptodate: > > crash> p/x ((struct page *) 0xc00c00000fe26180)->private > $2 = 0xc000003fc70a19c0 > > crash> p/x ((struct iomap_page *) 0xc000003fc70a19c0)->uptodate > $3 = {0xffff, 0x0} > > > It appears (after ~4 hours) that I can avoid the problem if I split > the loop so that bits are checked only after all updates are made. > Not sure if this correct approach, or just making it less reproducible: > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index e25901ae3ff4..abe37031c93d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -131,7 +131,11 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len) > for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { > if (i >= first && i <= last) > set_bit(i, iop->uptodate); > - else if (!test_bit(i, iop->uptodate)) > + } > + for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) { > + if (i >= first && i <= last) > + continue; > + if (!test_bit(i, iop->uptodate)) > uptodate = false; > } > } > > Thanks, > Jan > > [1] https://lore.kernel.org/stable/1420623640.14527843.1575289859701.JavaMail.zimbra@redhat.com/T/#u > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-07 0:09 ` dftxbs3e @ 2019-12-08 20:30 ` Eric Sandeen 2019-12-09 8:26 ` Jan Stancek 0 siblings, 1 reply; 9+ messages in thread From: Eric Sandeen @ 2019-12-08 20:30 UTC (permalink / raw) To: dftxbs3e, Jan Stancek, linux-xfs, linux-fsdevel, hch, darrick.wong, linuxppc-dev, Memory Management, LTP Mailing List, CKI Project, Michael Ellerman On 12/6/19 6:09 PM, dftxbs3e wrote: > Hello! > > I am very happy that someone has found this issue. > > I have been suffering from rather random SIGBUS errors in similar > conditions described by the author. > > I don't have much troubleshooting information to provide, however, I hit > the issue regularly so I could investigate during that. > > How do you debug such an issue? I tried a debugger etc. but besides > crashing with SIGBUS, I couldnt get any other meaningful information. You may want to test the patch Christoph sent on the original thread for this issue. -Eric ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later 2019-12-08 20:30 ` Eric Sandeen @ 2019-12-09 8:26 ` Jan Stancek 0 siblings, 0 replies; 9+ messages in thread From: Jan Stancek @ 2019-12-09 8:26 UTC (permalink / raw) To: dftxbs3e Cc: linux-xfs, linux-fsdevel, hch, Eric Sandeen, darrick wong, linuxppc-dev, Memory Management, LTP Mailing List, CKI Project, Michael Ellerman ----- Original Message ----- > > > On 12/6/19 6:09 PM, dftxbs3e wrote: > > Hello! > > > > I am very happy that someone has found this issue. > > > > I have been suffering from rather random SIGBUS errors in similar > > conditions described by the author. > > > > I don't have much troubleshooting information to provide, however, I hit > > the issue regularly so I could investigate during that. > > > > How do you debug such an issue? I tried a debugger etc. but besides > > crashing with SIGBUS, I couldnt get any other meaningful information. If it's same issue, you could check if dropping caches helps. Figure out what page is it with crash or systemtap and look at page->flags and ((struct iomap_page *)page->private)->uptodate bitmap. > > You may want to test the patch Christoph sent on the original thread for > this issue. Or v5.5-rc1, Christoph's patch has been merged: 1cea335d1db1 ("iomap: fix sub-page uptodate handling") ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-12-09 8:26 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <cki.6C6A189643.3T2ZUWEMOI@redhat.com> [not found] ` <1738119916.14437244.1575151003345.JavaMail.zimbra@redhat.com> [not found] ` <8736e3ffen.fsf@mpe.ellerman.id.au> [not found] ` <1420623640.14527843.1575289859701.JavaMail.zimbra@redhat.com> 2019-12-03 12:50 ` [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later Jan Stancek 2019-12-03 13:07 ` Christoph Hellwig 2019-12-03 14:35 ` Jan Stancek 2019-12-03 16:08 ` Darrick J. Wong 2019-12-03 19:09 ` Christoph Hellwig 2019-12-04 14:43 ` Jan Stancek [not found] <9c0af967-4916-4e8b-e77f-087515793d77@free.fr> 2019-12-07 0:09 ` dftxbs3e 2019-12-08 20:30 ` Eric Sandeen 2019-12-09 8:26 ` Jan Stancek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).