Hey,
So at my work we have a 4 node ESX 5.5 cluster on r710s. Each node has local drives (this one has 4x 1Tb SATA in R10) and dual gige to a 3220i with 24x500gb drives.
Yes I know it's all old but still running good and plans to get onto a 4 node c6220 later this year hopefully.
Anyways, the short story is that the BMC or Lifecycle or DRAC (which ever controls the front panel) has stopped reporting errors to VMWare. It also doesn't show errors via the system light and front panel.
I had the front bezel on, this obscured my view of the HDD lights. While trying to troubleshoot why ESX hardware tab wasn't showing everything I pulled the bezel off to try the keys on the control panel, I noticed that two hdds were in a failed state..
I'm assuming that it's a failed drive from each side of the R0 portion of the R10 or it wouldn't be a happy situation.
Yes we have nightly backups of the VMs that are on the local drives but it will take probably 7-8 hours to restore via gige.
I've vMotioned all but two of the VMs off the local drives to the 3220i but the two that are left are large.
The failing drives cause a spike in latency, up to 20,000-120,000ms about every 30 min. When this spike hits it caused the storage vMotion to fail with the error "A general system error occurred: The source detected that the destination failed to resume."
One of the VMs I could do manually through SSH, it seems to handle the latency chunk much nicer:
-SSH to host
-Stop VM.
-Remove VM from inventory.
-Go to local storage directory.
-Move the VM directory to the 3220i.
-Wait forever.
-Once moved, re-add back to inventory.
-Start VM again.
That process took 45 min for a 60gb VM. The VMs I need to move are 350gb and 700gb. The 350gb VM I can shut down and move using the above method. It'll take about 4 hours.
The other VM is a client VM which I'm guessing they don't want their VM shut down for 8+ hours while I move it via copy/paste.
Now you know the back story.
Is there a way to let vMotion move the VM without timing out during the spike? Similar to the SSH method but keeping the VM running?
Thanks in advance for all your help and your time to read my problem.
So at my work we have a 4 node ESX 5.5 cluster on r710s. Each node has local drives (this one has 4x 1Tb SATA in R10) and dual gige to a 3220i with 24x500gb drives.
Yes I know it's all old but still running good and plans to get onto a 4 node c6220 later this year hopefully.
Anyways, the short story is that the BMC or Lifecycle or DRAC (which ever controls the front panel) has stopped reporting errors to VMWare. It also doesn't show errors via the system light and front panel.
I had the front bezel on, this obscured my view of the HDD lights. While trying to troubleshoot why ESX hardware tab wasn't showing everything I pulled the bezel off to try the keys on the control panel, I noticed that two hdds were in a failed state..
I'm assuming that it's a failed drive from each side of the R0 portion of the R10 or it wouldn't be a happy situation.
Yes we have nightly backups of the VMs that are on the local drives but it will take probably 7-8 hours to restore via gige.
I've vMotioned all but two of the VMs off the local drives to the 3220i but the two that are left are large.
The failing drives cause a spike in latency, up to 20,000-120,000ms about every 30 min. When this spike hits it caused the storage vMotion to fail with the error "A general system error occurred: The source detected that the destination failed to resume."
One of the VMs I could do manually through SSH, it seems to handle the latency chunk much nicer:
-SSH to host
-Stop VM.
-Remove VM from inventory.
-Go to local storage directory.
-Move the VM directory to the 3220i.
-Wait forever.
-Once moved, re-add back to inventory.
-Start VM again.
That process took 45 min for a 60gb VM. The VMs I need to move are 350gb and 700gb. The 350gb VM I can shut down and move using the above method. It'll take about 4 hours.
The other VM is a client VM which I'm guessing they don't want their VM shut down for 8+ hours while I move it via copy/paste.
Now you know the back story.
Is there a way to let vMotion move the VM without timing out during the spike? Similar to the SSH method but keeping the VM running?
Thanks in advance for all your help and your time to read my problem.