Friday, July 22, 2011

Using tape libraries under ESX 4.1

In the old days of ESX 3.5, there was an unsupported feature that allowed SCSI pass-through to VMs. This allowed us to build a VM and connect our FC tape library to perform backup and restores using various backup software.
This isn't the highest performance backup/recovery solution, but it was definitily the most flexible. This allowed us to swap backup solutions (BackupExec, NetBackup, Networker, CommVault, etc..) by building out separate VMs for each configuration. By only having 1 VM on at any time, you could also share the same library across all the VMs.

This unsupported feature broken entirely with ESX4.0. SCSI pass-through would allow the VM to see the device, but it couldn't use the library. VMware had some experimental DirectPath features, but all our research suggested our configuration wouldnt work. To deal with the shortcomings, we kept 2 ESX hosts running at 3.5 to act as our Tape-Cluster, while we upgraded the rest to 4.0.

This extremely inefficient setup finally called for some changes, so I started looking for iSCSI solutions. In theory, the tape library can be presented via iSCSI to the OS running in the VM. Then, regardless of hardware (physical or virtual), the OS would be able to use the tape library.
There were a few hacked solutions out there, but only 1 came in an enterprise-ready setup. StarWind Software allows for publishing of physical and virtual devices (Hard Drive, Optical Drive, Tape Device) via iSCSI. Out of the box, the eval key allowed me to present my library and drives via iSCSI without any customization. Once the Microsoft iSCSI Initiator was installed and configured, the tape library was discovered without issue. Backup Exec readily accepted the changer and drives without question.

A physical server (4 proc, 32GB RAM) was built running Windows 2008R2 to operate as the iSCSI gateway. A single FC card connected to the tape library via Brocade switch. A single 10GB Ethernet connection was used for iSCSI. Overall, an over-sized system for this test, but it was available at the time.
Our existing ESX3.5 Backup Exec VM (the one configured for SCSI pass-through) was cloned and moved to a pristine ESX4.1 environment. The hardware version, NICs, and VM tools were upgraded to the latest version. Microsoft's iSCSI initiator was installed and connected to the gateway server.

So far I have found no deficiency in functionality. Once the iSCSI initiator connected, the library and drives discovered without intervention, the existing drivers supported the new devices without issue. Backup Exec was reconfigured with the new devices (same process as if a new library was connected), but otherwise there was minimal setup. The library inventories, catalogs, and restores the same as a FC connected system.
One issue I ran into - A coworker inserted tapes into the library while I was in-between tests, and somehow the library reported that it was rebooted. Backup Exec then reported the drive was offline, nothing I did on the Backup Exec server resolved the issue. Ultimately I had to restart the SolarWinds service on my gateway system to get things working again. Backup Exec then allowed me to set the drive as online and continue my testing.

I performed several tests using the iSCSI gateway and timed the results. I then performed the same tests on our existing ESX3.5 VM to compare the run times. 
Initial performance testing proved encouraging. Scanning the library, Inventorying the library and Cataloging the library all took about the same amount of time between the two configurations. Any variation in times could easily be attributed to the library doing other tasks at the same time. In my verdict, basic library/tape functions were comparable between the two setups
Restore testing is where the solution would really be tested. I performed multiple restores starting at a few MB up to a hundred GB. I made sure to use the same tape set and source data between the systems to ensure an appropriate comparison.

The performance differences for larger restores are astounding. Small restores and library functions were comparable, but when transferring large amounts of data, iSCSI falls well short of the goal.
Additionally, I had a few failures on the larger restores after the 50GB mark using iSCSI. The failures were intermittent, but additional research suggests this may be a limitation of the technology.

No comments: