On 2020-05-25 1:40 p.m., Matthew Wilcox wrote: > On Mon, May 25, 2020 at 12:30:52PM -0400, Douglas Gilbert wrote: >> On 2020-05-25 2:57 a.m., Hannes Reinecke wrote: >>> On 5/24/20 5:58 PM, Douglas Gilbert wrote: >>>> +++ b/drivers/scsi/scsi.c >>>> @@ -554,19 +554,32 @@ EXPORT_SYMBOL(scsi_device_put); >>>>   struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost, >>>>                          struct scsi_device *prev) >>>>   { >>>> -    struct list_head *list = (prev ? &prev->siblings : &shost->__devices); >>>> -    struct scsi_device *next = NULL; >>>> -    unsigned long flags; >>>> +    struct scsi_device *next = prev; >>>> +    unsigned long flags, l_idx; >>>>       spin_lock_irqsave(shost->host_lock, flags); >>>> -    while (list->next != &shost->__devices) { >>>> -        next = list_entry(list->next, struct scsi_device, siblings); >>>> -        /* skip devices that we can't get a reference to */ >>>> -        if (!scsi_device_get(next)) >>>> -            break; >>>> +    if (xa_empty(&shost->__devices)) { >>>>           next = NULL; >>>> -        list = list->next; >>>> +        goto unlock; >>>>       } > > You don't need this stanza. xa_find() will do the right thing (ie return > NULL) with an empty array. Okay. When using xarray to store a bitset (e.g. say the 9 states in: enum scsi_device_state [include/scsi/scsi_device.h] as 2**0, 2**1, 2**2 .... 2**8), is there a mask to indicate which bit positions are used internally by xarray? Or are the full 32 bits available, other than the special value 0 aka NULL? >>>> +    do { >>>> +        if (!next) {    /* get first element iff first iteration */ >>>> +            l_idx = 0; >>>> +            next = xa_find(&shost->__devices, &l_idx, >>>> +                       scsi_xa_limit_16b.max, XA_PRESENT); > > You don't need to use a 16-bit limit here; the XArray code will stop after > the end of the array, and the only thing you might save is scanning the top > node unnecessarily far. To get there, you'd have to have an index >= 2^12, > and the additional limit would mean that you'd scan the first 16 entries of > the top node instead of all 64. Kind of pales into insignificance compared > to actually walking all 2^12 elements up to that point ;-) > If we go to using parts of the hctl tuple (host, channel, target_id, lun) as xarray indexes, then large numbers can occur in unstable systems with say the same group of targets going offline and then back online when, for example, a power supply trips out. The number of elements will still be relatively small, say well less than 2**16. >>>> @@ -391,8 +391,8 @@ static void scsi_single_lun_run(struct >>>> scsi_device *current_sdev) >>>>       spin_lock_irqsave(shost->host_lock, flags); >>>>       if (starget->starget_sdev_user) >>>>           goto out; >>>> -    list_for_each_entry_safe(sdev, tmp, &starget->devices, >>>> -            same_target_siblings) { >>>> +    /* XARRAY: was list_for_each_entry_safe(); is this safe ? */ >>>> +    xa_for_each(&starget->devices, l_idx, sdev) { >>>>           if (sdev == current_sdev) >>>>               continue; >>>>           if (scsi_device_get(sdev)) >>> >>> Frankly, I don't know why we're using the _safe variant here. >>> Typically the _safe variant is used to protect against removal >>> (ie if a list element is removed while iterating over the list), >>> but I can't really imagine why starting I/O on the device would >>> cause one entry to be deleted... >>> James? > > ... to add, xa_for_each() is safe against deletion while walking. > Or addition while walking. > >>>> @@ -234,8 +234,9 @@ static struct scsi_device >>>> *scsi_alloc_sdev(struct scsi_target *starget, >>>>       sdev->channel = starget->channel; >>>>       mutex_init(&sdev->state_mutex); >>>>       sdev->sdev_state = SDEV_CREATED; >>>> -    INIT_LIST_HEAD(&sdev->siblings); >>>> -    INIT_LIST_HEAD(&sdev->same_target_siblings); >>>> +    /* XARRAY: INIT_LIST_HEAD()s here on list leaves, why ? */ >>>> +    sdev->sh_idx = -1; >>>> +    sdev->starg_idx = -1;        /* for debugging */ >>>>       INIT_LIST_HEAD(&sdev->starved_entry); >>>>       INIT_LIST_HEAD(&sdev->event_list); >>>>       spin_lock_init(&sdev->list_lock); >>> >>> It is good practice to call INIT_LIST_HEAD() on list leaves as then you >>> can call 'list_empty()' to figure out if they are part of a list or not. >>> >>> But I'm wondering about the 'sh_idx' and the 'starg_idx' here. >>> We already have perfectly good indices with the 'lun' and 'id' entries >>> in struct scsi_device, which (incidentally) _need_ to be unique. >>> >>> So why not use them? >> >> Ah now we have something to discuss and perhaps Willy can help >> >> At the target level we have two indexes: >> - channel (is 16 bits enough) >> - target (16 or 32 bits?) >> This could be address to ways: >> 1) introduce struct scsi_channel {int num_chans; scsi_target *t1p;} >> 2) combine channnel and target into one 32 bit integer. >> >> The xarray type of its index seems to be unsigned long (Willy, >> is that correct?). That is the same size as a pointer, 32 bits >> on a 32 bit machine, 64 bits on a 64 bit machine. That is >> unfortunate, since we can represent the full 64 bit LUN as >> a single index on a 32 bit machine. We could 'hide' that and >> have two xarrays in scsi_device for 32 bit machines. > > You aren't the first person to ask about having a 64-bit lookup on > 32-bit machines. Indeed, I remember Hannes asking for a 256 bit lookup > at LinuxCon in Prague. I have always been reluctant to add such a thing because the XArray uses quite a naive data type underneath. It works well with > dense arrays but becomes very wasteful of memory for sparse arrays. > > My understanding of SCSI-world is that most devices have a single > LUN 0. Most devices that have multiple LUNs number them sequentially. > Some devices have either an internal structure or essentially pick random > LUNs for the devices they expose. Yes an no. There are now all sorts of weird patterns hiding in the 64 bit LUNs. See chapter 4.7 in sam6r05.pdf over at www.t10.org . And arguably there is a 65th bit with the LU_CONG bit in an INQUIRY standard response changing the interpretation of the 64 bit LUN. >> So to start with, I took the easy approach, the one that most >> resembles the what include/linux/list.h has been using. There >> is also a (overly ?) complex locking scheme that I really did >> not want to upset. > > I think this is currently the best way to use the XArray. I do have > a project in the works to replace the data structure implementing the > XArray API with a better one, but that project has been underway for > almost two years now, so I would have a hard time suggesting that you > wait for it to be ready. > >> Those num_chans, t1p members that I put in scsi_channel is just >> to show the xarray lookup can be bypassed when num_chans=1 >> which is very likely. Same approach could be taken in scsi_target >> to scsi_device. > > If you do only have a single entry at 0, the XArray is very efficient. > It simplifies to a single pointer. You don't need to optimise for that > case yourself ;-) That is good to know. >>>> @@ -326,6 +331,7 @@ static void scsi_target_dev_release(struct device *dev) >>>>       struct device *parent = dev->parent; >>>>       struct scsi_target *starget = to_scsi_target(dev); >>>> +    xa_destroy(&starget->devices); > > By the way, you don't have to call xa_destroy() if you know the array > is empty. The IDR used to have that unfortunate requirement, and about a > third of users blissfully ignored it, leaking uncounted amounts of memory. > >>>> @@ -445,7 +455,15 @@ static struct scsi_target >>>> *scsi_alloc_target(struct device *parent, >>>>       if (found_target) >>>>           goto found; >>>> -    list_add_tail(&starget->siblings, &shost->__targets); >>>> +    /* XARRAY: was list_add_tail(); may get hole in xarray or tail */ >>>> +    res = xa_alloc(&shost->__targets, &u_idx, starget, scsi_xa_limit_16b, >>>> +               GFP_ATOMIC); >>>> +    if (res < 0) { >>>> +        spin_unlock_irqrestore(shost->host_lock, flags); >>>> +        pr_err("%s: xa_alloc failure errno=%d\n", __func__, -res); >>>> +        return NULL; >>>> +    } >>>> +    starget->sh_idx = u_idx; >>>>       spin_unlock_irqrestore(shost->host_lock, flags); >>>>       /* allocate and add */ >>>>       transport_setup_device(dev); >>> >>> 'scsi_xa_limit_16b' ? >>> What's that for? >> >> See include/linux/xarray.h xa_limit_32b >> >> This is a 16 bit equivalent. Not sure if that is too restrictive. >> Anyway, the xa_alloc() call amongst others needs an upper and >> lower bound. > > I'm happy to have an xa_limit_16b in xarray.h. I estimate it would have > about 5-10 users across the entire kernel. No, that was just experimental. >>> And, again, if we were to use the target id as an index we could be >>> using 'xa_store' and the problem would be solved... > > It may make sense to use an allocating XArray for LUNs and targets, but > use the channel ID directly. It depends how structured these things > are in SCSI, and I'll admit I haven't been keeping up with how SCSI > numbers things on a modern system (I suspect 99.9% of SCSI devices > are USB keys with one host, one channel, one target, one LUN, but > that the 0.1% of SCSI devices on Linux systems are responsible for 80% > of the performance end of the market) Attached is a UML (like) class diagram for folks to kick around. It handles u32 values for the host, channel and target_id level and u64 at the lun (leaf) level, via a bit of hackery, even on 32 bit machines. So the object tree is built on the diagonal, with the multiplicity coming from xarray members (attributes). The arrowed lines show inheritance, with struct device being the base class aka superclass (one could argue otherwise). The "transport_class_"s are added as needed by SCSI transports such as SAS and iSCSI. And it uses native xarray indexes up to 2**32 - 1. So it meets all the constraints we have currently. Plus, based on the answers I get from above questions, in the scsi_lu_low32 class I would like to add another xarray, that holds bitsets rather than pointers. Each bitset holds amongst other things the 9 scsi_device states. That way you could iterate through all LU devices looking for an uncommon state _without_ dereferencing pointers to any sdev_s that are bypassed. And if xarray is cache efficient in storing its data, then it would be much faster. The same thing (i.e. bitset xarray) could be done in scsi_channel, for starget states. Finally, for efficient xarray use, every object needs to be able to get to its parent (or at least, its "collection") quickly as that is where all xarray operations take place. And by efficient I do _not_ mean calling dev_to_shost(starget->dev.parent) when we built the object hierarchy and _know_ where its "shost" is. Doug Gilbert