英睿达(原镁光) SATA MX500系列 SSD 硬盘要注意磁盘耐久度问题 - 日常圈子 - 综合 - EVLIT

英睿达(原镁光) SATA MX500系列 SSD 硬盘要注意磁盘耐久度问题

2019年在京东买了一块英睿达 MX500 500GB的SSD硬盘,然后我就扔在机柜的塔式服务器里面跑Proxmox当系统盘用,也有部分VM用到这块硬盘,今年开始折腾k3s,也用这块硬盘做系统盘,没想到k3s装了一堆东西之后,每秒的磁盘IO达到了5MB/s,也没怎么在意。

上个月我进proxmox系统查看硬盘状态,看到wearout(磨损度)值还是84,心想应该没事的。直到今天我又查看了一遍,好家伙,直接97了。属实有些高了。

476c12a54f192122

习惯性去看SMART的数据,结果显示是这些值:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       3
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       1
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12066
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       101
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   003   003   000    Old_age   Always       -       1462
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       33
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       45
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       3
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   046   031   000    Old_age   Always       -       54 (Min/Max 0/69)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       170
202 Percent_Lifetime_Remain 0x0030   003   003   001    Old_age   Offline      -       97
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       3
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       115471620952
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       3129192176
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       20667757649

SMART Error Log Version: 1
No Errors Logged

注意点是202 Percent_Lifetime_Remain 0x0030 003 003 001 Old_age Offline - 97这个参数,直觉告诉我,这时候硬盘的耐久度应该是还剩下97才对。

问朋友也上网找资料,有说是磨损度是97了,有说还剩下97的耐久度。没有标准答案,于是我直接拨打英睿达的售后服务电话(回答问题的居然是个妹子,哈哈),咨询这个问题。刚开始妹子还说这个是还有97的耐久度的意思。但是我说我上个月看到的还是84,怎么耐久度还会增加的?妹子说这个得安装他们的软件进一步确认。然后要了我的邮箱给我发了软件。

由于我的是装在服务器里,没有GUI,于是下载msecli进行查验。然后看到原来真的是我的硬盘磨损度达到了97,只剩下3%的耐久度了。发出警告,告诉我硬盘即将到达寿命。

root@pve:~# msecli -L

Device Name          : /dev/sda
Model No             : CT500MX500SSD1
Serial No            : 1911E1F2550D
FW-Rev               : M3CR023
Total Size           : 500.00GB
Drive Status         : Attention! The Drive is Approaching the end of the Specified Lifetime. Prolonged usage will invalidate the warranty
Sata Link Speed      : Gen3 (6.0 Gbps)
Sata Link Max Speed  : Gen3 (6.0 Gbps)
Temp(C)              : 54

msecli看到的smart数据也和第三方smart看到的不一样。

在第三方SMART看到的Percent_Lifetime_Remain这个RAW值是97,很容易就认为是还剩余97的健康度

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       3
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       1
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12066
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       101
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   003   003   000    Old_age   Always       -       1462
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       33
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       45
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       3
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   046   031   000    Old_age   Always       -       54 (Min/Max 0/69)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       170
202 Percent_Lifetime_Remain 0x0030   003   003   001    Old_age   Offline      -       97
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       3
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       115471620952
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       3129192176
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       20667757649

SMART Error Log Version: 1
No Errors Logged

而在官方工具msecli里面查询到的SMART数据的Percentage Lifetime Remaining的值却只有3,意味着非常有可能下次开机就报废了

root@pve:~# msecli -L

Device Name  : /dev/sda
 ID  Attribute Name                Attribute Data Units
 1   Raw Read Error Rate           3		Errors/Page
 5   Reallocated NAND Block Count  1		NAND Blocks
 9   Power On Hours Count          12066	Hours
 12  Power Cycle Count             101		Power Cycles
 171 Program Fail Count            0		NAND Page Program Failures
 172 Erase Fail Count              0		NAND Block Erase Failures
 173 Block Wear-Leveling Count     1462		Erases
 174 Unexpected Power Loss Count   33		Unexpected Power Loss events
 180 Unused Reserved Block Count   45		Blocks
 183 SATA Interface Downshift      3		Downshifts
 184 Error Correction Count        0		Correction Events
 187 Reported Uncorrectable Errors 0		ECC Correction Failures
 194 Enclosure Temperature         55		Current Temperature (C)
                                   69		Highest Lifetime Temperature (C)
 196 Reallocation Event Count      1		Events
 197 Current Pending ECC Count     0		ECC Counts
 198 SMART Off-line Scan           0		Errors
     Uncorrectable Errors
 199 Ultra-DMA CRC Error Count     170		Errors
 202 Percentage Lifetime Remaining 3		% Lifetime Remaining
 206 Write Error Rate              0		Program Fails/MB
 210 RAIN Successful Recovery      3		TUs successfully recovered by
     Page Count                                 RAIN
 246 Cumulative Host Write         115472345976	512 Byte Sectors
     Sector Count
 247 Host Program Page Count       3129465248	NAND Page
 248 FTL Program Page Count        20668114065	NAND Page

我也发了邮件回复英睿达官方的工作人员,一来是反馈我遇到的问题,二来也是感谢人家的工作。很快官方也回复了我,原文如图:

96f137ab80193558

这个事情也是给各位提个醒,平时有事没事多看看硬盘的数据,无论是SSD还是普通硬盘,也不管你是消费级硬盘还是企业级硬盘,数据无价,要多留意。不说了,我要去备份数据了。。。买了一个硬盘在路上,明天恢复系统到新硬盘去。

写在最后

msecli软件在帖子最后一张图片中有,这里补充一下Linux系统的软件:

适用于Windows系统的:https://www.crucial.com/support/storage-executive
适用于Linux系统的:https://www.micron.com/products/ssd/storage-executive-software

请登录后发表评论

    没有回复内容