On Sun, Nov 20 2016, Konstantin Khlebnikov wrote:

> On 07.11.2016 23:34, Konstantin Khlebnikov wrote:
>> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li <shli@kernel.org> wrote:
>>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote:
>>>> return_io() resolves request_queue even if trace point isn't active:
>>>>
>>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>>>> {
>>>>       return bdev->bd_disk->queue;    /* this is never NULL */
>>>> }
>>>>
>>>> static void return_io(struct bio_list *return_bi)
>>>> {
>>>>       struct bio *bi;
>>>>       while ((bi = bio_list_pop(return_bi)) != NULL) {
>>>>               bi->bi_iter.bi_size = 0;
>>>>               trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
>>>>                                        bi, 0);
>>>>               bio_endio(bi);
>>>>       }
>>>> }
>>>
>>> I can't see how this could happen. What kind of tests/environment are these running?
>>
>> That was a random piece of production somewhere.
>> Cording to time all crashes happened soon after reboot.
>> There're several raids, probably some of them were still under resync.
>>
>> For now we have only few machines with this kernel. But I'm sure that
>> I'll get much more soon =)
>
> I've added this debug patch for catching overflow of active stripes in bio
>
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio)
>   {
>          atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
>          atomic_inc(segments);
> +       BUG_ON(!(atomic_read(segments) & 0xffff));
>   }
>
> And got this. Counter in %edx = 0x00010000
>
> So, looks like one bio (discard?) can cover more than 65535 stripes

65535 stripes - 256M.  I guess that is possible.  Christoph has
suggested that now would be a good time to stop using bi_phys_segments
like this.

I have some patches which should fix this.  I'll post them shortly.  I'd
appreciate it if you would test and confirm that they work (and don't
break anything else)

Thanks,
NeilBrown