I was not actively involved in managing our production Storage Spaces Direct (S2D) infrastructure for few weeks now, but this weekend I was asked to help with an issue and through this I again learned a lot new. Some of my findings I want to share here with you.
Story leading to error…
This weekend was reserved for patching production S2D servers (Windows server 2016 RTM – core) with several patches out of 2018-04 bundle (KB4093119, KB4093120 and KB4093137; ##update# Error is not bound to this April CU only and has been seen earlier.##). Prior doing this, we also updated Firmware versions on our Cisco UCS servers. YES, this are the same servers I had trouble connecting to few days ago and I blogged about solution here.
It all started easy and first (out of 6) server was updated without a glitch. There was also no troubles for StorageJobs to complete and all S2D volumes were up and healthy. OK. Time to go to second server… Here all updates also had no trouble to install.
Trouble with Physical drives…
After completed patching, we were faced with surprise. Running Get-PhisysicalDisk on S2D node revealed, that not all disk drives are OK. Since I wanted to know on how many and which nodes, I ran this command:
$Enclo = Get-StorageEnclosure
foreach ($item in $Enclo)
$diski= $item | Get-PhysicalDisk | Where-Object OperationalStatus -NotLike OK | Select-Object FriendlyName, SerialNumber,SlotNumber,FirmwareVersion,MediaType,OperationalStatus,OperationalDetails,HealthStatus, Usage, Size
$diski | Format-Table
Few facts about this:
- Transient error was only on disk drives, that belong to server we have just updated,
- Not all drives on this server were in error, in fact only 12 out of 20
- There were both,healthy and unhealthy, SSDS and HDDs
This error seemed familiar to me and I started to search the internet. I came across article here, where Romain Serre describes replacing broken hard drive in S2D infrastructure. But there were 2 small things that did not come together. When disk is broken in S2D, there are in fact 2 errors; “Transient error” and “IO Error” and I had hard time believing, that 12 drives would break suddenly and at the same time.
…trying to repair was in vain
I did not give up easily and we restarted server, made a hard restart plugging power from server, but status did not change. I even tried to repair S2D with powershell command:
Repair-ClusterS2D -Node BrokenNodeName -Verbose
…but without success.
…time to call for help.
Since we did not really know what impact this error will have on our fully loaded servers after people are back on work at Monday and since having inconsistent patch levels on cluster nodes is not desirable, we decided to call MS support for help.
To make long story of checking short, we came to following conclusions:
- MS support has already heard of such issues and they have already developed private fix, which we could get as Premier support customer
- Private fix can only be applied AFTER all S2D nodes receive 2018-04 update KB4093119
- There is also public update in preparation and it is about to be published soon (2018/05/09 update –> fix has now been issued as part of 2018-05 CU and can be found here)
- Fix needs to be applied only to nodes, that have error.
Since we got detailed explanation what Transient error actually means (look at explanation further bellow) and were confident, that we cannot lose data, we proceeded with patching remainder of servers with all patches the first 2 servers were already patched. It turned out, that none of next servers was affected with the same bug. So before applying private fix, we had only initial 12 disk drives in “Transient error” state. I could confirm this by this command:
Get-PhysicalDisk |Where-Object OperationalStatus -NotLike OK| Sort-Object SerialNumber| Format-Table FriendlyName, SerialNumber,SlotNumber,FirmwareVersion,MediaType,OperationalStatus,OperationalDetails,HealthStatus, Usage, Size
As last step we still applied private fix, which has remediated “Transient error” from drives on that single host.
What does “Transient error” described in this article actually mean?
Explanation from MS support:
“Transient error on a drive usually means that the cache binding was dropped, and so reads to certain regions of the capacity device will be failed by cluster since cluster no longer knows where the data was being cached on the cache devices. ”
Some additional facts I was able to gather:
- Error only tells us, that caching for the drives marked, is not working as it should (please be careful, that there is not also IO error, which would show to broken hardware)
- this state would cause NO data loss
- Since cache-disk relation is broken, there can be bigger performance impact whole S2D installation (I cannot tell you exactly what impact it would have, since we did not want to wait for working day without applying a fix)
- There is no rule that this BUG will manifest itself when applying 2018-04 updates. In our case only 1 out of 6 S2D nodes, was affected.
After a bit longer weekend, whole our S2D infrastructure is patched, oiled, healthy and waiting for working week. And, what is most important, we have all learned a lot new.
So, today it is even more important for me to say…