Thursday, February 17, 2011

Rerun failed Avamar VMDK backup jobs

With our recent use of VMDK backups via Avamar, I have been annoyed at having to deal with at least 1 failure every night. Resolution would normally require logging into the Avamar console, finding the appropriate machine(s), and relaunching the backup. This assumes that it isn't a production critical system and can launch a backup during the day. Compound this with the relatively short retention cycle we have defined (2 weeks), we want to make sure all backups are kept current and viable.
A little research and I found that there is a command line client for Avamar that runs on Linux. Reading through the MCCLI Programmer Guide, you will find simple command lines that can launch on-demand backups. So now I have a semi-automated method to initiate failed backups, I just need to identify which backups have failed.
The Avamar grid stores activity in a Postgres database. I am not sure if this is supposed to be used by admins, but it is fairly well laid out and navagable using pgAdmin. With a little trial and error, I was able to craft up a SQL statement that would report back all VMDK backup failures and allow me to relaunch them.

Below is the script and query I used to automate this. The script runs the query and exports it to a file called Cron is used to initiate the queryDB script, and a few minutes later, initiate the rerun script.

export PGHOST=<FQDN of grid here>
export PGPORT=5555
export PGDATABASE=mcdb
export PGUSER=viewuser
export PGPASSWORD=viewuser1
psql -tf /<Path to SQL command>/queryCMD -o /<Path to rerun command>/

Postgesql statement, queryCMD

select distinct '/usr/local/avamar/5.0.3-29/bin/mccli client backup-group-dataset '

|| '--domain=<VC Name>/VirtualMachines '

|| '--group-domain=<VC Name> '

|| '--group-name="' || group_name || '" --name=' || display_name 

from v_activities_2 

where (display_name, recorded_date_time) in

    (select DISTINCT b.display_name, max(b.recorded_date_time)

    from v_activities_2 b

    where b.group_name like 'Tier 5 VM%' 

    AND recorded_date_time > CURRENT_TIMESTAMP - interval '6 day' 

    group by b.display_name)

AND status_code_summary <> 'Activity completed successfully.'; 

1 comment:

John Barness said...

Thank you for sharing this.
I think when it comes to data backup, especially if it is related to business documentation or so, there should be really best data room virtual services involved. Data loss may have a very high price in the business world.