Monday, July 13, 2009

Recovering NFS datastores

I recently setup a linux server running an NFS export that I mounted in my VMware cluster. This was going to essentially be for archive information - offline servers that were not needed to be active, but still had to be kept in case they were needed. Everything was working great and I needed to reboot the server one last time before letting it run on its own.
When I rebooted, the NFS datastores became unavailable - as expected since the server was offline. However, when the server came back online the NFS datastores still reported as unavailable. Even after waiting several hours, the datastores did not come back.

A little googling and I found several people restarting services, running esxcfg-nas -r (which is supposedly unsupported), and even rebooting their VMware hosts. The only option I could get to work was rebooting my physical host, but that only resolved connectivity on that one host. And that could become a nightmare if I had to reboot every single host everytime my NFS server had to make changes.

I thought about removing the datastores and then readding them - a simple matter of a powershell script. However that would leave any guest systems on those datastores unavailable and require me to re-import all of them. Then I realized that I could remove and re-add the datastores to an individual ESX host, without breaking any bindings or cluster wide configurations.

Resolution:
To resolve this, we need to remove and re-add the datastore on the individual ESX server. This needs to be done on each ESX server that is reporting the issue
  1. Connect via SSH to the ESX host
  2. List the NFS datastores by running esxcfg-nas -l
  3. Remove the offending datastore by running esxcfg-nas -d datastorename
  4. Re-add the datastore by running esxcfg-nas -a -o hostname -s sharename datastorename
  5. Wait a few minutes for the VC to be updated with the datastore status

Why this works:
It appears that the individual servers check datastore connectivity on a routine cycle, then reports status to the VC server. By removing and re-adding a datastore at the host level, the VC still sees it as the same datastore (it does have the same name and same target). As such, the next time the ESX host polls the datastores, it reports it as active and the VC doesnt know any different.
This is of-course all an educated guess, and I encourage anyone with more understanding on the mechanics to update the technical details.

No comments: