Monday, December 31, 2007

Creating a SCOM custom group based on an installed service

As with most environments, I have some applications that require custom monitoring and tweaking. In this instance the application is our backup system (Tivoli Storage Manager to be specific), and due to the high I/O of the application SCOM is alerting on it daily. So here is the plan: Create a group based on the TSM Server service, override the monitors for this group to reset thresholds as appropriate.

Creating a group:

Creating the group is fairly simple, but not intuitive. The first thing to do is create an attribute to base the group on. In this instance I am looking for the existence of a registry key "HKLM\SYSTEM\CurrentControlSet\Services\TSM Server1"; I initially wanted to perform a WMI query but ran into unanswered problems so we are sticking with the registry for the time being. To create the attribute:

  1. In the SCOM console, go to the Authoring pane and select Attributes | Create new attribute
  2. Enter the name, next
  3. Discovery Method
    1. Discovery Type: Registry
    2. Target: Windows Operating System_Extended
      1. To get this, browse to the Windows Operating System, the wizard then creates the extended version because it is a sealed MP.
    3. Management Pack: Your New MP
  4. Registry Probe Configuration
    1. Key or Value type: Key
    2. Path: SYSTEM\CurrentControlSet\Services\TSM Server1
    3. Attribute Type: Check if exists
    4. Frequency: 3600 seconds

There are likely some improvements I could do here, but at this time I am only looking for the existence of the key (assuming the existence means its installed and running). The next step is to create a group based on this new attribute.

  1. In the SCOM console, go to the Authoring pane and select Groups | Create a new group
  2. Name your group appropriately (TSM Servers) and select your MP
    1. Ensure you select the same MP the attribute is stored in, otherwise the extended class may not be available
  3. Skip the Explicit Members screen – the goal is to have 100% dynamic membership
  4. On Dynamic Members, click Create/Edit rules
    1. Select the Windows Operating System_Extended class and click Add
    2. Property: TSM Server
    3. Operator: Equals
    4. Value: True
  5. Finish out the wizard and your new group will appear.

It may take some time for all the systems to report the attribute and be joined into the group, the suggestion is to wait 10-20 minutes for systems to appear.

Thursday, December 20, 2007

Using NLB for redundant print servers

Working at an ASP, limiting downtime is critical; not only could it cost us money it could also cost us hard earned customers. One of the critical services we provide is printing - i.e. printing paychecks, printing pick-tickets for warehouses, printing invoices, etc... It could be said that if printing goes down then the customer comes to a halt. These issues have been compounded by the fact that we are running a Citrix environment which is extreamly sensitive to print drivers and printing issues.



To work around this we have come up with several short, medium, and long-term solutions. For the short-term we are using the Microsoft Print Migrator tool (printmig.exe) to backup all print drivers and configurations on a weekly basis. This way if a bad driver or configuration is introduced into the environment we can quickly roll-back to the prior config. For the medium-term, we are developing a printer compatibility list - similar to a windows hardware compatibility list - of known good and bad printers and drivers. This list is reviewed with every new printer entering the environment to ensure the appropriate drivers are being used. These solutions ensure stability of the software on the print servers, but nothing ensures the stability of the hardware.



To improve the hardware of the print servers we have 2 general choices - use brand new high end hardware, or setup a redundant/clustered solution. Brand new high end hardware is very expensive, and while it improves the likelyhood of uptime it doesnt ensure it. Clustering the print server is supported and documented by MS, but that requires the Enterprise version of Windows along with a shared SCSI disk. Its also a hot-spare scenario, essentially doubling your costs (more if you include the shared disk), without improving performance. Lastly, there is the idea of using a Network Load Balancer to balance the workload between multiple servers. This could be setup using Microsoft's NLB to share the load between the systems, and then use the print migration tool to synchronize the printers between the servers.



The NLB is the configuration we are in process of implementing now. I ran the test in the lab and the only drawback I found was that if a server has a sudden failure then any print jobs in the queue would be lost. Since jobs only stay in queue for a short time (only until they are forwarded to the printer) this will be reletivly few, all new or in process jobs would fall over to the second NLB node. This also allows the print servers to balance loads between themselves, allowing them to be lower end systems. Below is the general configuration I tested in the lab successfully.


Using NLB for redundant print servers



  1. Setup both print servers with 2 NICs – 1 DHCP, 1 static
  2. Setup NLB using the NICs with the static addresses
  3. In the registry set the following (per http://support.microsoft.com/kb/281308/)
    • HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters
    • Value name: DisableStrictNameChecking
    • Data type: REG_DWORD
    • Radix: Decimal
    • Value: 1

  4. Create a DNS Alias for the NLB VIP
  5. Install and configure printers on print server A
  6. Using the printer migration tool, backup and restore all printers from server A to server Ba. Create a recurring schedule to perform this nightly
  7. From the clients install printers from file://dnsalias/

NLB Cluster Properties













TabSettingValue
Cluster ParametersIP Address
Cluster ParametersSubnet Mask
Cluster ParametersFull internet nameDNS alias of cluster
Cluster ParametersCluster operation modeUnicast
Cluster ParametersAllow remote controlUnChecked
Cluster IP Addressn/an/a
Port RulesFrom0
Port RulesTo65535
Port RulesProtocolsBoth
Port RulesFiltering ModeMultiple host
Port RulesAffinityNone



















Update:
Once I got the hardware and NLB setup I tested using printmig.exe to mirror the printers from server A to server B. I created a script that runs on server A nightly via scheduled task and the backup works but the restore to server B doesnt. A little debugging and I came across the %windir%\system32\spool\pm\pm.log and the error "FAILURE - Can't get printer driver directory:". Some research showed that this is a security setting that stops printmig from working remotely (from server A to server B) but not locally. To fix this go to Start Run and type gpedit.msc. Browse to Local Computer Policy Computer Configuration Administrative Templates Printers and set "Allow Print Spooler to accept client connections" to Enabled"

Update 2:
Once I began running the print migrator tool to move from our shared printserver to this redundant configuration, I found the below error in the %windir%\system32\spool\pm\pm.log
2008:01:07 08:37:40 WARNING: Kernel Mode drivers (version 2) are blocked on the target machine. Disable Kernel Mode driver blocking and re-run Printer Migrator. Ignoring this warning (Cancel button) will result in driver installation, but because they are kernel mode drivers - a serious problem with any dependent print queue could potentially bring down the system. Selecting OK will result in restore termination.
Another google search shows yet another group policy that needs to be set to allow version2 drivers. NOTE: This is not generally a good idea, version 2 drivers are kernel mode and therefore can crash the entire server, not just the spooler. To resolve this issue, update the driver or perform the following steps:
  1. Open gpedit.msc
  2. Browse to Computer Configuration Administrative Templates Printers
  3. Select "Disabllow installation of printers using kernel-mode drivers" and set to disabled.

Wednesday, December 19, 2007

Migrating SQL servers

In the process of our SCOM rollout we used a temporary server for our SQL databases. Finally the new hardware came in, but now we have to migrate the databases with minimal downtime. This is compounded by the fact that the new server is 64bit OS and SQL, where the original was 32bit. For posterity, this is what I did to migrate the server to new hardware:
  1. Bring up the new server with 64bit OS, named as server1a (instead of server1)
  2. Setup the drives in the standard partitioning scheme (C: OS, D: SQL install, F: SQL data, G: SQL backup, L: SQL logs, T: SQL Temp)
  3. Install SQL 2005 64bit on D:
  4. Move the temp database to T:
  5. Copy the SCOM databases from the old server to the new
  6. Shutdown the original server1
  7. Rename Windows from server1a to server1
  8. Rename SQL from server1a to server1
Moving the tempDB
Using the informaiton from the article at http://www.databasejournal.com/features/mssql/article.php/3379901 I executed the following SQL script
use master
go
Alter database tempdb modify file (name = tempdev, filename = 't:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\tempdb.mdf')
go
Alter database tempdb modify file (name = templog, filename = 't:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\templog.ldf')
Go

I then stopped SQL and copied the tempdb.mdf and templog.mdf from their default locations to the new location on T.

Copying the databases
Stop SQL on the originating server
Use robocopy to copy the data and log files to the new server (robocopy \\server1\f$ f:\ /mir /r:0 /w:0)
In the new server, attach the databases

Renaming the server
Once the system was up and running with the databases attached, I then ran the following TSQL to rename the SQL instance to the new server
exec sp_dropserver 'server1a'
GO
sp_addserver 'server1', localexec
GO
exec sp_dropremotelogin 'server1a'

GO

use MSDB
DECLARE @srv sysname
SET @srv = CAST(SERVERPROPERTY('server1') AS sysname)
UPDATE sysjobs SET originating_server = @srv

Finally I renamed the windows server (powering off the original server first) and rebooted. Reinstall the SCOM datawarehouse and reporting - Voila, everything works!

- PS -
Looking back on the process I probably would have been better off to shrink the database files prior to the robocopy. This process took an extended amount of time and could have been sped up significantly.

Friday, December 14, 2007

Case sensitive regex in SCOM

I went to create a new group in my SCOM environment based on computer name, the servers I want are named similar with the last character incremented. Into SCOM console I go, Authoring Groups Create a new group. For the dynamic members query, I enter "
( Object is Windows Computer AND ( Principal Name Matches regular expression wttsm.* ) AND True ) ". I finish the wizard and click View group members...; only 1 system "wttsm2" is showing. A quick look into the computer properties show the second machine as "WTTSM1", upper case....

Googling for case sensitive regex turns up http://www.exampledepot.com/egs/java.util.regex/Case.html, while it's for java it may still work. I change the query to "
( Object is Windows Computer AND ( Principal Name Matches regular expression (?i)wttsm.* ) AND True ) " and then check the members, both systems now appear.

Personally I don't know if case sensitive regex is a good thing in the SCOM rules, but at least now I know how to recover from it

Thursday, December 13, 2007

SCOM Maintenance Mode

One of the main goals for implmenting system management tools such as SMS (SCCM) and MOM (SCOM) is to automate routine tasks, unfortunatly these tools sometimes makes the task more difficult than it needs to be. For instance SCOM has a maintenance mode function which allows monitoring of a system to be suspended while planned work is being done. This aleviates false-alerting and enables scheduled maintenance of a system. The only way to place a system into maintenance mode however is by using the SCOM console.

There are several other scripts available that allow for placing servers into maintenance mode, but they were either clunky or closed source (or both), and ultimately made the task more difficult than it should be (in my opinion). A few examples are available at :

My goal is to have something that fits the following:

  • Runs from the command line for general scripting
  • Can add and remove systems from maintenance mode
  • Easy to use so Tier 1/2 operators can use it
  • Highly portable, should run on any system in the environment - any OS or domain

The first round came up with the below C# applet, it uses .NET 2.0 and the SCOM dlls to manage the maintenance mode settings. My hope for the next version will be a C# web service that runs on the managemetn server and is interfaced via VBScript; this specifically targets the last bullet since it removes the requirement for .NET, the SCOM console, and correct domain membership.

ZIP file containing the source and compiled EXE at http://edgoad.googlepages.com/MaintMode.zip

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Text;
using Microsoft.EnterpriseManagement;
using Microsoft.EnterpriseManagement.Administration;
using Microsoft.EnterpriseManagement.Common;
using Microsoft.EnterpriseManagement.Configuration;
using Microsoft.EnterpriseManagement.Monitoring;

// http://msdn2.microsoft.com/en-us/library/bb437532.aspx - Maintenance Mode
// http://msdn2.microsoft.com/en-us/library/bb437559.aspx - repairing agents
namespace MaintMode
{
class Program
{
static void Main(string[] args)
{
Program myProg = new Program();
if ((args.Length != 3) && (args.Length != 6))
{
myProg.DisplayHelp();
}
else
{
string sAction = args[0].ToLower();
string sManagementServer = args[1].ToLower();
string sAgentName = args[2].ToLower();
switch (sAction)
{
case "/a":
MaintenanceModeReason eReason = (MaintenanceModeReason)Enum.ToObject(typeof(MaintenanceModeReason), Convert.ToInt32(args[3]));
string sComment = args[4];
int iMinutes = Convert.ToInt32(args[5]);
myProg.StartMM(sManagementServer, sAgentName, eReason, sComment, iMinutes);
break;
case "/r":
myProg.EndMM(sManagementServer, sAgentName);
break;
case "/q":
myProg.QueryMM(sManagementServer, sAgentName);
//myProg.EndMM(sManagementServer, sAgentName);
break;
}
}
}
public void StartMM(string managementServer, string agentName, MaintenanceModeReason eReason, string sComment, int iMinutes)
{
DateTime startTime = DateTime.UtcNow;
DateTime endTime = DateTime.UtcNow.AddMinutes(iMinutes);
ManagementGroup mg = new ManagementGroup(managementServer);
ReadOnlyCollection monitoringObjects;
MonitoringClass computerMonitoringClass;
MonitoringObjectCriteria criteria;
computerMonitoringClass = mg.GetMonitoringClass(SystemMonitoringClass.WindowsComputer);
criteria = new MonitoringObjectCriteria(string.Format("Name like '" + agentName + "%'"), computerMonitoringClass);
monitoringObjects = mg.GetMonitoringObjects(criteria);
if (monitoringObjects.Count < 1)
{
Console.WriteLine("ComputerName '" + agentName + "' at ManagementServer '" + managementServer + "' is not found");
}
else
{
foreach (MonitoringObject monitoringObject in monitoringObjects)
{
if (!monitoringObject.InMaintenanceMode)
{
monitoringObject.ScheduleMaintenanceMode(startTime, endTime, eReason, sComment);
Console.WriteLine(monitoringObject.DisplayName + " is now in maintenance mode until " + endTime.ToLocalTime().ToShortTimeString());
}
else
{
MaintenanceWindow window = monitoringObject.GetMaintenanceWindow();
endTime = window.ScheduledEndTime;
Console.WriteLine(monitoringObject.DisplayName + " is ALREADY in maintenance mode until " + endTime.ToLocalTime().ToShortTimeString());
}
}
}
}
public void EndMM(string managementServer, string agentName)
{
ManagementGroup mg = new ManagementGroup(managementServer);
ReadOnlyCollection monitoringObjects;
MonitoringClass computerMonitoringClass;
MonitoringObjectCriteria criteria;
computerMonitoringClass = mg.GetMonitoringClass(SystemMonitoringClass.WindowsComputer);
criteria = new MonitoringObjectCriteria(string.Format("Name like '" + agentName + "%'"), computerMonitoringClass);
monitoringObjects = mg.GetMonitoringObjects(criteria);
if (monitoringObjects.Count < 1)
{
Console.WriteLine("ComputerName '" + agentName + "' at ManagementServer '" + managementServer + "' is not found");
}
else
{
foreach (MonitoringObject monitoringObject in monitoringObjects)
{
if (monitoringObject.InMaintenanceMode)
{
monitoringObject.StopMaintenanceMode(DateTime.UtcNow);
Console.WriteLine(monitoringObject.DisplayName + " is now out of maintenance mode");
}
else
{
Console.WriteLine(monitoringObject.DisplayName + " is ALREADY out of maintenance mode");
}
}
}
}
public void QueryMM(string managementServer, string agentName)
{
ManagementGroup mg = new ManagementGroup(managementServer);
ReadOnlyCollection monitoringObjects;
MonitoringClass computerMonitoringClass;
MonitoringObjectCriteria criteria;
computerMonitoringClass = mg.GetMonitoringClass(SystemMonitoringClass.WindowsComputer);
criteria = new MonitoringObjectCriteria(string.Format("Name like '" + agentName + "%'"), computerMonitoringClass);
monitoringObjects = mg.GetMonitoringObjects(criteria);
if (monitoringObjects.Count < 1)
{
Console.WriteLine("ComputerName '" + agentName + "' at ManagementServer '" + managementServer + "' is not found");
}
else
{
foreach (MonitoringObject monitoringObject in monitoringObjects)
{
if (monitoringObject.InMaintenanceMode)
{
MaintenanceWindow window = monitoringObject.GetMaintenanceWindow();
DateTime endTime = window.ScheduledEndTime;
string sComment = window.Comments;
MaintenanceModeReason eReason = window.Reason;
Console.WriteLine(monitoringObject.DisplayName + " is in maintenance mode until " + endTime.ToLocalTime().ToShortTimeString());
Console.WriteLine("Comments: " + sComment);
Console.WriteLine("Reason : " + eReason.ToString());
}
else
{
Console.WriteLine(monitoringObject.DisplayName + " is NOT in maintenance mode");
}
}
}
}
public AgentManagedComputer Connect(string managementServer, string computerName)
{
ManagementGroup mg = new ManagementGroup(managementServer);
// TODO: Add try/catch to ManagementGroup
ManagementGroupAdministration admin = mg.GetAdministration();
// Fully qualified name of the agent-managed computer.
string fullAgentComputerName = computerName + "%";
string query = "Name LIKE '" + fullAgentComputerName + "'";
AgentManagedComputerCriteria agentCriteria = new AgentManagedComputerCriteria(query);
ReadOnlyCollection agents = admin.GetAgentManagedComputers(agentCriteria);
if (agents.Count != 1)
throw new InvalidOperationException("Error! Expected one managed computer with: " + query);
AgentManagedComputer myAgent = agents[0];
return myAgent;
}
public void DisplayHelp()
{
string usage = "MaintMode.exe";
usage += "\n==============================================";
usage += "\nAdd or remove a server from maintenance mode";
usage += "\n==============================================";
usage += "\nUsage: maintmode.exe [/a /r /q] ManagementServer ComputerName [ReasonCode] [Comment] [Time]";
usage += "\n /a: Place server into maintenace mode";
usage += "\n /r: Remove server from maintenace mode";
usage += "\n /r: Query server maintenace mode information";
usage += "\n\n ManagementServer: Name of root management server";
usage += "\n ComputerName: Name of computer to place into maintenance mode";
usage += "\n ReasonCode: List of Reasoncode values (0-14):\n (required to add a server into maintenance)";
string[] myS = Enum.GetNames(typeof(MaintenanceModeReason));
foreach (string s in myS)
usage += "\n " + (int)Enum.Parse(typeof(MaintenanceModeReason), s, true) + ":" + s;
usage += "\n Comment: Comment of reason for maintenance mode\n (required to add a server into maintenance)";
usage += "\n Time: Time in minutes for maintenace mode\n (required to add a server into maintenance)";
usage += "\n\nExample: adding a machine into maintenance mode for 60 minutes";
usage += "\n maintmode.exe /a wtmsscom1 wtrpengeg \"add to maint mode\" 60";
usage += "\nExample: removing a machine from maintenance mode";
usage += "\n maintmode.exe /r wtmsscom1 wtrpengeg";
Console.WriteLine(usage);
}
}
}

SCOM 2007 SP1 - Parameters not working with custom tasks

Scenario: I wanted to create a task to allow a remote console (cmd line) to target machines. I have seen this done several times in MOM2005 using PSExec (http://www.it-jedi.net/HOW-TO/HOW-TO-Remote_Interactive_CmdPrompt.doc) so it should be a simple matter of translating this to the new formats.

I created a new Console Tasks Command line,
- Name: Remote Console
- Task target: Agent
- Application: \\wtmsscom1\scripts\psexec.exe
- Parameters: \\$Target/Property[Type= SystemLibrary6050000!System.Entity ]/DisplayName$ cmd
- Working Directory: \\wtmsscom1\scripts

When I execute the task I get the following:
----------------------------------------------------------
PsExec v1.43 - execute processes remotely
Copyright (C) 2001-2003 Mark Russinovich
www.sysinternals.com

Connecting to $Target/Property[Type=...
The network path was not found.

Couldn't access $Target/Property[Type=:
Make sure that the default admin$ share is enabled on $Target/Property[Type=.

----------------------------------------------------------

For some reason, the parameter is not being substituted by SCOM properly. I validated there were now typo's or other errors, but nothing seemed to work. As a last resort I exported the MP and viewed the XML -- bingo!

\\wtmsscom1\scripts\psexec.exe

\\$Target/Host/Property[Type=
SystemLibrary6050000!System.Entity
]/DisplayName$
cmd

\\wtmsscom1\scripts\


Somehow the parameter is being parsed into 3 separate parameters, a bit of editing and I come up with the following which imports and executes properly:

\\wtmsscom1\scripts\psexec.exe

\\$Target/Host/Property[Type="SystemLibrary6050000!System.Entity"]/DisplayName$
cmd

\\wtmsscom1\scripts\



Was I missing something here? Is this a bug in SP1?

***************************************************************
Update (12/18) it looks like somehow the attribute is being misinterpreted by SCOM and the quotes are the cause. Below is the line I entered, but after it is the line that SCOM sees, with the quotes replaced as spaces:
  • $Target/Host/Property[Type="SystemLibrary6050000!System.Entity"]/DisplayName$
  • $Target/Host/Property[Type= SystemLibrary6050000!System.Entity ]/DisplayName$
If the SCOM console misinterprets the quotes as spaces, then it makes sense that the task wouldnt run properly. Hopefully this will be fixed prior to RTM or there will be a lot of mistakes make in my custom tasks.

Monday, December 03, 2007

File copy failures across WAN links

We have a customer that is trying to copy data from his JDEdwards enterprise server (NT SQL) across the extranet and it fails intermittently. After researchign the issue, the customer suggested that the issue has been occuring for a while and always seems to happen during times of high load (payroll processing, etc...). The file copy works from another server on our network (dev client on same subnet as enterprise server).
A quick look into the normal areas: event log - CPU, memory, disk, network - showed no bottleneck that would explain this. To make matters worse, we tried performing the same file copy from our local network and everything works fine. Several network traces and reviews were performed and while a few errors were found, there was nothing that could explain this issue.
Finally I setup a perfmon trace to run for the day on a couple hundred objects, expecting to review the logs and see what is spiking during business hours. After the trace completed, I noticed the Memory: Free System Page Table Entries was holding steady at around 3000, much lower than it should be (an unstressed system is over 150000).
A quick google search turns up a MSDN blog (http://blogs.msdn.com/chadboyd/archive/2007/03/24/pae-and-3gb-and-awe-oh-my.aspx) discussing PTEs and how they fit into the realm of memory management. After reading the article state the PTEs should never be below 7000 I was surprised that we didnt run into more problems with this server.

OK, so the root cause is now identified, how do I resolve it? If this server was running SQL only I would have no problems with including the /USERVA switch, or even removing the /3GB switch, but this server also had JDEdwards installed and I dont know how it works with memory... Contacting my usual sources did little for me; they all knew how much memory JDE needed and how to configure/size it, but nobody knew how much system memory was needed or if JDE could use AWE like SQL does.
A little more googling and I found a PPT discussing JDE on windows (http://www.microsoft-oracle.com/Assets/ppt/JDE%20on%20SQL%20Server%202005%20webcast.ppt). This doc specifically states that JDE should use the /3GB switch, which suggests it needs the extra memory space also.... If I try and tune the memory by adjusting the settings in the BOOT.INI I may be able to resolve this issue, but it could compromise the SQL and JDE performance.

Below are the solutions I have identified to resolve the issue:
  1. Remove the /3GB switch from boot.ini – Most of the JDE and SQL documentation I discovered suggested enabling this setting, removing it would cause SQL and JDE performance to degrade. Not a suggested option
  2. Adjust the memory allocation with the /USERVA switch in boot.ini – Similar to removing the /3GB switch, this can cause SQL and JDE performance to degrade. It may be possible to find an acceptable middle ground, but that may take several weeks to identify the correct setting. Not a suggested option
  3. Have JDE drop files to other system – This could address the immediate problem but not the root issue. Not a suggested option
  4. Rebuild system to 64bit OS – This would resolve the memory management issues. Suggested option
  5. Move SQL and Logic systems to different hardware – This separates SQL and JDE management to separate servers which allows more granular control and separation of resources. It also maps to the configuration to the new standard for NT SQL customers of this size. Suggested option