Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
From: Guenter Roeck <linux@roeck-us.net>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: linux-hwmon@vger.kernel.org, Jean Delvare <jdelvare@suse.com>,
	Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, Jens Axboe <axboe@fb.com>,
	Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] nvme: hwmon: provide temperature min and max values for each sensor
Date: Mon, 11 Nov 2019 09:35:05 -0800
Message-ID: <20191111173505.GB5826@roeck-us.net> (raw)
In-Reply-To: <CAC5umyiju2Q2fdfVaFyX+Q=sMKr5Gsc_GDVYmSa0vB+w8acvAw@mail.gmail.com>

On Tue, Nov 12, 2019 at 12:56:21AM +0900, Akinobu Mita wrote:
> 2019年11月11日(月) 1:30 Guenter Roeck <linux@roeck-us.net>:
> >
> 
[ ... ]

> > > Example output from the "sensors" command:
> > >
> > > nvme-pci-0100
> > > Adapter: PCI adapter
> > > Composite:    +53.0 C  (low  = -273.0 C, high = +70.0 C)
> > >                         (crit = +80.0 C)
> > > Sensor 1:     +56.0 C  (low  = -273.0 C, high = +65262.0 C)
> > > Sensor 2:     +51.0 C  (low  = -273.0 C, high = +65262.0 C)
> > > Sensor 5:     +73.0 C  (low  = -273.0 C, high = +65262.0 C)
> > >
> >
> > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing
> > any minimum limit on the Composite temperature sensor results in a temperature
> > warning, and that warning is sticky until I reset the controller.
> > I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others.
> 
> I have Crucial CT500P1SSD8 and WDC WDS512G1X0C-00ENX0, and I have no
> problem with these devices.
> 
> > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100
> > nvme-pci-0100
> > Adapter: PCI adapter
> > Composite:    +30.0°C  (low  = -273.0°C, high = +70.0°C)
> >                         (crit = +80.0°C)
> >
> > root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min
> > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100
> > nvme-pci-0100
> > Adapter: PCI adapter
> > Composite:    +30.0°C  (low  =  +0.0°C, high = +70.0°C)  ALARM
> >                         (crit = +80.0°C)
> >
> > It doesn't seem to matter which temperature I write; writing -273000 has
> > the same result.
> >
> > [This is actually why I didn't use the features commands; not that I had observed
> >   the problem, but I was concerned that problems like this would show up.]
> 
> Maybe we should introduce a new quirk so that we can avoid changing
> temperature threshold for such devices.  Could you tell SSDPEKKW512G7's
> vendor and device ID?  Quick googling answers it's 8086:f1a5, but I want
> to make sure.

Yes, that is correct.

01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03)

I'll see if I can test this tonight on my other NVMEs. I also dug up an old
NVMe drive from Toshiba; I'll see if I can connect and test it as well.

[ ... ]

> > >        */
> > >       switch (attr) {
> > >       case hwmon_temp_max:
> > > -             *val = (data->ctrl->wctemp - 273) * 1000;
> > > +             err = nvme_get_temp_thresh(data->ctrl, channel, false, val);
> > > +             if (err)
> > > +                     *val = (data->ctrl->wctemp - 273) * 1000;
> >
> > This would report WCTEMP for all sensors on errors, including errors seen while
> > the controller is resetting. I think it should be something like
> >
> >                 int err = 0;
> >                 ...
> >
> >                 if (!channel)
> >                         *val = (data->ctrl->wctemp - 273) * 1000;
> >                 else
> >                         err = nvme_get_temp_thresh(data->ctrl, channel, false, val);
> >                 return err;
> >
> > assuming we keep using ctrl->wctemp (see below). If changing the upper Composite
> > temperature sensor limit changes wctemp, and we don't update it, we should not
> > use it at all after registration and just report the error.
> >
> > >               return 0;
> > > +     case hwmon_temp_min:
> > > +             return nvme_get_temp_thresh(data->ctrl, channel, true, val);
> > >       case hwmon_temp_crit:
> > >               *val = (data->ctrl->cctemp - 273) * 1000;
> > >               return 0;
> > > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> > >       return err;
> > >   }
> > >
> > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
> > > +                         u32 attr, int channel, long val)
> > > +{
> > > +     struct nvme_hwmon_data *data = dev_get_drvdata(dev);
> > > +
> > > +     switch (attr) {
> > > +     case hwmon_temp_max:
> > > +             return nvme_set_temp_thresh(data->ctrl, channel, false, val);
> >
> > Does this change WCTEMP if written on channel 0 ? If so, we would have to update
> > the cached value of ctrl->wctemp (or never use it after registration).
> 
> At least for the devices I have, setting the over temperature threshold
> doesn't change the WCTEMP.
> I have checked with  'nvme id-ctrl /dev/nvme0 | grep ctemp'.
> 

Interesting. I just tested this, and the result is the same with Samsung
SSD 970 EVO. With that in mind, maybe we should really not use wctemp
at all after initialization, as I had suggested above. What do you think ?

Thanks,
Guenter

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-10 14:17 Akinobu Mita
2019-11-10 16:30 ` Guenter Roeck
2019-11-11 15:56   ` Akinobu Mita
2019-11-11 17:35     ` Guenter Roeck [this message]
2019-11-12 14:40       ` Akinobu Mita
2019-11-12 15:04         ` Guenter Roeck
2019-11-12 15:06           ` Christoph Hellwig
2019-11-12 16:35             ` Guenter Roeck
2019-11-11 16:53 ` Christoph Hellwig
2019-11-12 14:19   ` Akinobu Mita
2019-11-12 14:21     ` Christoph Hellwig
2019-11-12 15:00       ` Akinobu Mita
2019-11-12 15:08         ` Christoph Hellwig
2019-11-12 16:38         ` Guenter Roeck
2019-11-13 12:58           ` Akinobu Mita
2019-11-13 14:11             ` Guenter Roeck

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191111173505.GB5826@roeck-us.net \
    --to=linux@roeck-us.net \
    --cc=akinobu.mita@gmail.com \
    --cc=axboe@fb.com \
    --cc=hch@lst.de \
    --cc=jdelvare@suse.com \
    --cc=kbusch@kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git