From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754120AbdLNSvQ (ORCPT ); Thu, 14 Dec 2017 13:51:16 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:49006 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753800AbdLNSvO (ORCPT ); Thu, 14 Dec 2017 13:51:14 -0500 X-IronPort-AV: E=Sophos;i="5.45,401,1508774400"; d="scan'208";a="65511402" From: Bart Van Assche To: "tj@kernel.org" , "axboe@kernel.dk" CC: "linux-kernel@vger.kernel.org" , "peterz@infradead.org" , "linux-block@vger.kernel.org" , "kernel-team@fb.com" , "oleg@redhat.com" , "hch@lst.de" , "jianchao.w.wang@oracle.com" , "osandov@fb.com" Subject: Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Thread-Topic: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme Thread-Index: AQHTc3vSEJieVXIS9EOPhdZv8hd1dKNDMgqA Date: Thu, 14 Dec 2017 18:51:11 +0000 Message-ID: <1513277469.2475.43.camel@wdc.com> References: <20171212190134.535941-1-tj@kernel.org> <20171212190134.535941-3-tj@kernel.org> In-Reply-To: <20171212190134.535941-3-tj@kernel.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [199.255.44.171] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;BY1PR0401MB1530;20:3i+5nBlC5oMIPX0olsXPZsam5aAiJnL9tX5UolZaAPrgyV4ehrphjxPZumzpC6hyPBkD7shkKspkEhVveaMXvu9s0h1YeojmLa49hMS+H6lXaxoMGfpdBejmxPJV5d8TTPHu8peF4O4zB7t0TybVVMaXo0D/zOBd2XSWRdOaiBs= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: c09de200-ee9a-4f66-92c9-08d54323a534 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(5600026)(4604075)(4534020)(4602075)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603307)(7153051);SRVR:BY1PR0401MB1530; x-ms-traffictypediagnostic: BY1PR0401MB1530: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Bart.VanAssche@wdc.com; wdcipoutbound: EOP-TRUE x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(788757137089)(17755550239193); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(6040450)(2401047)(5005006)(8121501046)(10201501046)(3002001)(93006095)(93001095)(3231023)(6055026)(6041248)(20161123562025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123558100)(20161123555025)(20161123564025)(20161123560025)(6072148)(201708071742011);SRVR:BY1PR0401MB1530;BCL:0;PCL:0;RULEID:(100000803101)(100110400095);SRVR:BY1PR0401MB1530; x-forefront-prvs: 05214FD68E x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(366004)(39860400002)(346002)(376002)(396003)(377424004)(24454002)(199004)(189003)(53936002)(478600001)(72206003)(54906003)(2900100001)(86362001)(3660700001)(6246003)(105586002)(3280700002)(106356001)(14454004)(316002)(5250100002)(110136005)(25786009)(99286004)(6512007)(103116003)(97736004)(7416002)(66066001)(5660300001)(8676002)(81166006)(81156014)(36756003)(4001150100001)(7736002)(229853002)(59450400001)(76176011)(6436002)(6506007)(6486002)(305945005)(2501003)(102836003)(68736007)(6116002)(3846002)(8936002)(4326008)(2906002)(2950100002);DIR:OUT;SFP:1102;SCL:1;SRVR:BY1PR0401MB1530;H:BY1PR0401MB1532.namprd04.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: <6EB1DF471A072F4DBF2DACE3295024BB@namprd04.prod.outlook.com> MIME-Version: 1.0 X-OriginatorOrg: wdc.com X-MS-Exchange-CrossTenant-Network-Message-Id: c09de200-ee9a-4f66-92c9-08d54323a534 X-MS-Exchange-CrossTenant-originalarrivaltime: 14 Dec 2017 18:51:11.5364 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: b61c8803-16f3-4c35-9b17-6f65f441df86 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY1PR0401MB1530 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id vBEIpVnE009878 On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote: > rules. Unfortunatley, it contains quite a few holes. ^^^^^^^^^^^^^ Unfortunately? > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary ^^^^^^^^^^^^^^^ synchronization? > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > rq->start_time = jiffies; > set_start_time_ns(rq); > rq->part = NULL; > + seqcount_init(&rq->gstate_seq); > + u64_stats_init(&rq->aborted_gstate_sync); > } > EXPORT_SYMBOL(blk_rq_init); Sorry but the above change looks ugly to me. My understanding is that blk_rq_init() is only used inside the block layer to initialize legacy block layer requests while gstate_seq and aborted_gstate_sync are only relevant for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is called for blk-mq requests such that the above change can be left out? The only callers outside the block layer core of blk_rq_init() I know of are ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI code if you want. > + write_seqcount_begin(&rq->gstate_seq); > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > + blk_add_timer(rq); > + write_seqcount_end(&rq->gstate_seq); My understanding is that both write_seqcount_begin() and write_seqcount_end() trigger a write memory barrier. Is a seqcount really faster than a spinlock? > > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved) > __blk_mq_complete_request(req); > break; > case BLK_EH_RESET_TIMER: > + /* > + * As nothing prevents from completion happening while > + * ->aborted_gstate is set, this may lead to ignored > + * completions and further spurious timeouts. > + */ > + u64_stats_update_begin(&req->aborted_gstate_sync); > + req->aborted_gstate = 0; > + u64_stats_update_end(&req->aborted_gstate_sync); If a blk-mq request is resubmitted 2**62 times, can that result in the above code setting aborted_gstate to the same value as gstate? Isn't that a bug? If so, how about setting aborted_gstate in the above code to e.g. gstate ^ (2**63)? > @@ -228,6 +230,27 @@ struct request { > > unsigned short write_hint; > > + /* > + * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state > + * value and the upper bits the generation number which is > + * monotonically incremented and used to distinguish the reuse > + * instances. > + * > + * ->gstate_seq allows updates to ->gstate and other fields > + * (currently ->deadline) during request start to be read > + * atomically from the timeout path, so that it can operate on a > + * coherent set of information. > + */ > + seqcount_t gstate_seq; > + u64 gstate; > + > + /* > + * ->aborted_gstate is used by the timeout to claim a specific > + * recycle instance of this request. See blk_mq_timeout_work(). > + */ > + struct u64_stats_sync aborted_gstate_sync; > + u64 aborted_gstate; > + > unsigned long deadline; > struct list_head timeout_list; Why are gstate and aborted_gstate 64-bit variables? What makes you think that 32 bits would not be enough? Thanks, Bart.