Building a Server HW Monitoring System for Snow License Manager

The Gap Snow Doesn't Fill

Snow License Manager is excellent at collecting hardware inventory. It discovers servers, tracks configurations, and maintains a detailed asset database. What it doesn't do is tell you when something changes.

A server gets a memory upgrade — no alert. A machine stops reporting inventory for two weeks — silence. Hardware parameters shift across your fleet — you find out during the next manual review, if you're lucky.

For an enterprise telecom client managing thousands of servers, this gap was a real operational risk. They needed proactive alerts when server hardware parameters changed or inventory data went stale. Snow had the data; it just had no mechanism to act on it.

That's the problem I set out to solve.

v1: From Zero to Automated Alerts in One Sprint

The customer request came in October 2025 with a clear scope: detect hardware changes and inventory staleness, then notify the right people automatically. I estimated 30-40 hours and started building.

The Original Architecture

I designed a three-layer system. The original plan called for a dedicated SQL Server database — separate from Snow's own database. Clean separation: independent backups, no risk of polluting Snow's data with historical monitoring tables, easier to manage and migrate. This pattern had worked before — the client previously ran Snow in Azure, where creating additional databases on the SQL Server was straightforward.

But the client had recently migrated Snow from Azure to on-premises infrastructure due to performance issues. And that changed everything.

The Weekend Pivot

Right in the middle of implementation, I discovered that the on-prem SQL Server had security policies forbidding the creation or management of new databases by the Snow DB user. I escalated. The customer confirmed: getting elevated SQL rights for our account was not an option. Not "it'll take a while" — simply not happening.

This is the harsh reality of enterprise environments. You adapt to constraints. You don't override them.

The implementation deadline was Monday. I had server access over the weekend. And I had an idea that, as an architect, I hated: redesign the entire system to operate inside the only database I had access to — Snow License Manager's own database. Store the monitoring tables right alongside Snow's inventory data. Give them their own naming convention to namespace the customizations for potential future migration if a Snow version upgrade doesn't handle them gracefully.

Under normal circumstances, I would have said no and waited another month for proper access — if it ever came. This was double the work and already outside the original project scope. But I had Claude Code. When you properly structure all the input data — the existing SQL schemas, the stored procedure logic, the table relationships, the connection strings — it became a few well-crafted prompts to restructure the whole project. Not instant — it still took several hours to verify, test, and implement the new architecture. But the difference between "a few hours with Claude Code" and "weeks of manual rework" is the difference between meeting a deadline and missing it.

Monday morning, the SAM team had their monitoring tool. The deadline was met.

Is the architecture ideal? No, it isn't. The monitoring tables live inside Snow's database when they should have their own. But the system works — it has worked every week since deployment, processing thousands of servers without issues. And I'm now skilled enough to rebuild it cleanly if the need ever arises.

The Architecture (Post-Pivot)

Here's what the final three-layer system looks like:

Layer 1 — SQL Server Monitoring Tables (inside Snow's database). Six tables and seven stored procedures form the detection engine, namespaced within Snow License Manager's database. The core logic compares current inventory snapshots against previous baselines, flagging any hardware parameter that changed — CPU, memory, disk configuration, network adapters. A separate staleness tracker identifies servers that haven't reported inventory within configurable thresholds.

Layer 2 — Python Notification Service. A Python script queries the monitoring tables, formats the results into structured email alerts, and sends them via Microsoft Exchange (using the exchangelib library). Each alert includes the server name, what changed, the previous value, and the new value.

Layer 3 — Snow License Manager Custom Fields. I added custom fields directly in Snow's UI, allowing the client to configure monitoring parameters per server — which servers to monitor, staleness thresholds, notification recipients. This meant the SAM team could adjust monitoring behavior without touching any code.

The First Surprise

During initial development, I discovered that the memory calculation filter was far too narrow. The original query scope covered only a fraction of the actual server fleet. After expanding the filter criteria to match reality, the system went from monitoring a few hundred computers to covering thousands — a roughly 30x difference. Exactly the kind of gap that manual processes never catch because nobody knows what they're missing.

v1 Goes to Production — And Immediately Teaches Lessons

The system went live in late November 2025. Within days, reality introduced itself.

Problem 1: Duplicate Notifications. Servers were appearing multiple times in the same alert. The root cause was straightforward — the detection table wasn't being cleared between runs. I fixed this with a TRUNCATE before each INSERT cycle, ensuring clean state for every monitoring run.

Problem 2: Hundreds of Servers in an Email. The first real monitoring run detected hardware changes across hundreds of servers. Embedding that many results inline in an email was unreadable. I added CSV export as an email attachment — the summary email shows the count and highlights, while the detailed data lives in a structured CSV that the SAM team can filter and sort.

Problem 3: Language. The client's SAM team operates in Russian. I translated the user guide and email templates, including field labels and alert descriptions. A small detail, but one that determines whether people actually use the system.

These weren't bugs — they were the inevitable gap between building a system and operating one. Each fix took hours, not days, because the architecture was clean enough to modify without side effects.

v2: Making the System Smarter

By mid-December 2025, the client had been running v1 for three weeks and had clear feedback: the alerts were useful but too noisy. Too many devices that were legitimately decommissioned or temporarily offline were generating alerts that the team had to manually dismiss every week.

I planned five enhancements for v2:

Device Exclusion via Snow UI

The most impactful change: I added a custom field in Snow License Manager that lets the SAM team mark specific devices as excluded from monitoring. No code changes, no database edits — just check a box in the Snow UI, and that server stops generating alerts.

Automatic Re-Inclusion

Excluded devices don't stay excluded forever. When a previously excluded server starts reporting fresh inventory data again, the system automatically clears the exclusion flag. This prevents "exclude and forget" scenarios where decommissioned servers silently return to the network.

Active Directory Disabled OU Filtering

Servers moved to the "Disabled" organizational unit in Active Directory are filtered out automatically. No point alerting on machines the organization has already marked as inactive.

Unified Linux/Windows Monitoring Logic

v1 had separate monitoring paths for Linux and Windows servers. v2 unified them into a single logic path, reducing code duplication and ensuring both platforms get the same detection quality.

LastHeartbeatUTC Switch

Switched the staleness calculation from the last scan timestamp to LastHeartbeatUTC — a more reliable indicator of whether the Snow agent is actually communicating.

The v2 Deployment — And an Unexpected Bug

I deployed v2 on December 25-26, 2025, working from a 1,767-line implementation plan and a 970-line production migration guide. The deployment itself went smoothly.

Then testing revealed a critical issue: 24 servers were generating false positive alerts — 16% of the total alert volume. The root cause was subtle: Active Directory duplicate records. Some servers had multiple AD entries, and the monitoring query was joining against all of them, creating phantom change detections.

I implemented a hotfix (v2.0.1) with deduplication logic that resolves AD records to unique server identities before comparison. The fix eliminated all 24 false positives.

The Numbers

The v2 impact was measurable:

Weekly alerts dropped from ~150 to ~50 devices — a 65% reduction. Every remaining alert represents a genuine hardware change or staleness issue that needs attention.
False positives reduced by 95%. The team stopped ignoring alerts because the alerts started being trustworthy.
Self-service management. The SAM team controls monitoring entirely through Snow's UI — no developer involvement needed for day-to-day operations.

What I Learned

Ship v1 fast, iterate with real data. The v1 prototype exposed problems (duplicates, scale, language) that no amount of design could have predicted. Three weeks of production data shaped v2 better than three months of planning would have.

Give users the controls. Device exclusion via Snow's existing UI was the single most impactful feature. It turned the system from "something the developer manages" into "something the SAM team owns." The exclusion custom field gets used daily; the code hasn't been touched since deployment.

False positives destroy trust faster than missing alerts. When 16% of alerts are noise, people stop reading all of them. The AD deduplication hotfix didn't just fix 24 false positives — it restored confidence in the entire system.

Monitor the monitoring. The initial few-hundred-computer scope vs. the actual thousands-strong fleet was a reminder that the first version of any monitoring system is usually watching the wrong things, or not enough of them. Always validate coverage against reality.

The system continues to run in production, processing thousands of servers weekly with minimal maintenance. The architecture — SQL detection engine, Python notification service, Snow UI configuration — has proven stable enough that the client's SAM team operates it independently.