On Sun, Nov 20 2016, Konstantin Khlebnikov wrote: > On 07.11.2016 23:34, Konstantin Khlebnikov wrote: >> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li wrote: >>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote: >>>> return_io() resolves request_queue even if trace point isn't active: >>>> >>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev) >>>> { >>>> return bdev->bd_disk->queue; /* this is never NULL */ >>>> } >>>> >>>> static void return_io(struct bio_list *return_bi) >>>> { >>>> struct bio *bi; >>>> while ((bi = bio_list_pop(return_bi)) != NULL) { >>>> bi->bi_iter.bi_size = 0; >>>> trace_block_bio_complete(bdev_get_queue(bi->bi_bdev), >>>> bi, 0); >>>> bio_endio(bi); >>>> } >>>> } >>> >>> I can't see how this could happen. What kind of tests/environment are these running? >> >> That was a random piece of production somewhere. >> Cording to time all crashes happened soon after reboot. >> There're several raids, probably some of them were still under resync. >> >> For now we have only few machines with this kernel. But I'm sure that >> I'll get much more soon =) > > I've added this debug patch for catching overflow of active stripes in bio > > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio) > { > atomic_t *segments = (atomic_t *)&bio->bi_phys_segments; > atomic_inc(segments); > + BUG_ON(!(atomic_read(segments) & 0xffff)); > } > > And got this. Counter in %edx = 0x00010000 > > So, looks like one bio (discard?) can cover more than 65535 stripes 65535 stripes - 256M. I guess that is possible. Christoph has suggested that now would be a good time to stop using bi_phys_segments like this. I have some patches which should fix this. I'll post them shortly. I'd appreciate it if you would test and confirm that they work (and don't break anything else) Thanks, NeilBrown