SQL Server in the Real World

David P. Clements

In this article, David shows that, with a little creativity, you can actually implement two-way replication in SQL Server 6.5.

In the real world, doing things "by the book" isn't always a practical option. What I describe in this article is one such non-ideal situation in a production environment at U.S. Steel, and how we're coping with the hand we were dealt and other external constraints. Specifically, we have a situation where two-way replication between two servers is needed on a 24x7 basis, despite various recommendations and warnings to avoid using replication in such a way.

Recently, we inherited responsibility for several computers running one of our galvanizing lines. The guts of this setup are an OLTP production server with a small number of clients -- all running NT 4.0 -- and another "mill setup" server running NT 3.5.1 that communicates directly with the programmable logic controllers (PLCs) that drive the mill machinery. All are members of a larger domain, consisting of about 20 servers and more than 500 WinNT/Win95 clients. Network load generally isn't a problem, though we do tend to notice retries and time-outs early in the morning, when lots of the Win95 boxes are logging into the domain or rebooting.

The dual-processor production server uses SQL Server 6.5 to store both customer order information it receives from a mainframe at corporate headquarters and locally gathered data about the finished product. Client software for the operator interfaces was developed in-house using Visual Basic. The single-processor mill setup server was provided by an outside vendor as part of the overall mill control installation. It uses a custom, vendor-written NT service to send setup parameters to the PLCs and to poll them regularly for status and housekeeping data. It also uses SQL Server 6.5 for its local data store.

Parameters in the order data received by the production server are needed by the setup server, and the mill status data gathered by the setup server is needed by the production server to generate final production messages. Perhaps the best answer to this problem would be to put everything on one big server using a single database, but since the two sides of this setup were developed separately (and not even in the same state), this wasn't an easily available option at installation time. Thus, the chosen solution to this joint information need was to set up both servers as publishers and subscribers to each other's databases using SQL Server replication.

A charitable description of the setup when we inherited it would be "less than stable." There had been various attempts to fix and upgrade the installation, but errors and failures still left the mill operating in manual mode regularly. I initially thought that the design and installation of the SQL database had been completely unplanned, but I later found the written "Upgrade Strategy" document for the production server. Are you ready? Here's the entire text of the "SQL configuration" section:

Set security context to integrated
Set network protocol to TCP/IP, Named Pipes and Multi-Protocol
Set Master device size to 50M
Set concurrent connections to 50
Set SQL Executive and SQL Server to auto start after boot-up.

Everything else had been left at default settings, and all devices were installed on the C:\ drive. Attempts to solve log space errors had resulted in the log devices being expanded to the point that there was only about 10M of free space left on the 2G C:\ drive. The mill setup server had a larger disk, but all of its devices were also on a single drive (at least it was the D:\ drive, not the operating system drive). Very little custom configuration seemed to have been done on its SQL Server installation, either.

Operating management wasn't about to shut down a mill making several thousand tons of steel each week just to let us experiment with the server configurations. Thus, we made what improvements we could on the fly, and planned for a bigger attack on the problem during the next scheduled maintenance outage. Using mirroring and un-mirroring on the production server, we were able to shift some of the database device files to the D:\ drive without interrupting operations.

The production server had 96M of RAM, of which 80M was assigned to SQL Server. The default setting of 30 percent of SQL's RAM was assigned to the stored procedure cache. Performance monitoring showed that a maximum of about 15 percent of the procedure cache was in use, and the SQL configuration was changed accordingly, which allocated about 18M more of RAM for the data cache. The mill setup server had 64M of RAM, with only 16M assigned to SQL Server. PerfMon also showed that despite the (usually high) default 30 percent allocation of SQL memory to the stored procedure cache, it was running at 100 percent full. More memory was ordered for that server, and completely new memory allocations were calculated for the new installation there. Prior to the new memory's arrival, SQL Server's share of RAM was reconfigured to be 40M, and the procedure cache size (not percentage) was increased appropriately.

At our first extensive (>24 hour) opportunity to work with the computers off-line, the SQL Server database on the production server was blown away (after transferring relevant data to a temporary area) and reinstalled from scratch. We created data and log devices using sizes that were calculated to easily hold more than one year's worth of production data -- namely, 75M for data and 35M for the log file. This was a far cry from the more than 1700M the previous installation had expanded to. RAM configurations were redefined in what we hoped would optimize SQL Server's performance. The distribution database was recreated with a larger log space (Microsoft recommends a minimum of 15M, the original installation used 10M, and we now run above the minimum). We used the database maintenance wizard to create scheduled maintenance tasks -- something that hadn't been done before and might not have been advisable anyway, given Microsoft's recommendation of not doing such tasks on databases over 400M. The mill runs full-time, meaning that there's no advantage to scheduling tasks around midnight or on Sundays. Thus we staggered the background tasks during the day shift, allowing us to monitor performance and check for trouble more quickly and easily.

We immediately noticed better performance. Client queries that previously had noticeable lag times associated with them now happen almost instantaneously. The nature of the order data replicated is that it arrives in large chunks irregularly from the mainframe, so while the overall load is small, at times it can get hectic. Mysterious replication task failures have all but disappeared, though log reader failures still happen occasionally on the mill setup server, and whenever we think we have every data audit finally covered, someone finds a way to get us with an "off by one" error. We are still tweaking various configuration parameters and scheduled tasks, and we're monitoring things constantly (at least on the day shift) in our search for the perfect setup.

What I hope this shows is that with sufficient planning and work, installations that don't follow perfect textbook examples can be made to function in the real world. And conversely, that installations done with little or no planning probably won't work. However, one shouldn't expect much help from conventional sources (read Microsoft support) when developing and troubleshooting such installations, because you're likely to get the textbook answer that "you shouldn't be doing it that way." For now, however, we've increased the reliability of our applications, which pleases the operating folks greatly, and apart from saving the corporation money also pleases us from a personal standpoint, since it directly leads to a drop in those unappreciated 2:00 a.m. telephone calls. And while it might be a stretch to claim that improving two-way replication's performance helps retain our employees, in these times of low unemployment, anything that improves morale and decreases midnight grumbling can't hurt.

David Peregrine Clements, a Maine native, is now living in self-described "exile" in Mountain Brook, AL (near Birmingham). He has a B.S. in physics and astronomy from Bowdoin College in Maine and M.S. and Ph.D. degrees, also in physics and astronomy, from the University of Delaware. He's currently a process control engineer at USS and is responsible for maintaining computer models and databases for the sheet products division of USS. He spent two years as a research technician with the U.S. Antarctic Research Program -- one year at the South Pole and one at McMurdo Station. www.concentric.net/~clements.