Everything else had been left at default settings, and all devices were
installed on the C:\ drive. Attempts to solve log space errors had resulted
in the log devices being expanded to the point that there was only about
10M of free space left on the 2G C:\ drive. The mill setup server had a
larger disk, but all of its devices were also on a single drive (at least
it was the D:\ drive, not the operating system drive). Very little custom
configuration seemed to have been done on its SQL Server installation, either.
Operating management wasn't about to shut down a mill making several thousand
tons of steel each week just to let us experiment with the server configurations.
Thus, we made what improvements we could on the fly, and planned for a bigger
attack on the problem during the next scheduled maintenance outage. Using
mirroring and un-mirroring on the production server, we were able to shift
some of the database device files to the D:\ drive without interrupting
operations.
The production server had 96M of RAM, of which 80M was assigned to SQL Server.
The default setting of 30 percent of SQL's RAM was assigned to the stored
procedure cache. Performance monitoring showed that a maximum of about 15
percent of the procedure cache was in use, and the SQL configuration was
changed accordingly, which allocated about 18M more of RAM for the data
cache. The mill setup server had 64M of RAM, with only 16M assigned to SQL
Server. PerfMon also showed that despite the (usually high) default 30 percent
allocation of SQL memory to the stored procedure cache, it was running at
100 percent full. More memory was ordered for that server, and completely
new memory allocations were calculated for the new installation there. Prior
to the new memory's arrival, SQL Server's share of RAM was reconfigured
to be 40M, and the procedure cache size (not percentage) was increased appropriately.
At our first extensive (>24 hour) opportunity to work with the computers
off-line, the SQL Server database on the production server was blown away
(after transferring relevant data to a temporary area) and reinstalled from
scratch. We created data and log devices using sizes that were calculated
to easily hold more than one year's worth of production data -- namely,
75M for data and 35M for the log file. This was a far cry from the more
than 1700M the previous installation had expanded to. RAM configurations
were redefined in what we hoped would optimize SQL Server's performance.
The distribution database was recreated with a larger log space (Microsoft
recommends a minimum of 15M, the original installation used 10M, and we
now run above the minimum). We used the database maintenance wizard to create
scheduled maintenance tasks -- something that hadn't been done before and
might not have been advisable anyway, given Microsoft's recommendation of
not doing such tasks on databases over 400M. The mill runs full-time, meaning
that there's no advantage to scheduling tasks around midnight or on Sundays.
Thus we staggered the background tasks during the day shift, allowing us
to monitor performance and check for trouble more quickly and easily.
We immediately noticed better performance. Client queries that previously
had noticeable lag times associated with them now happen almost instantaneously.
The nature of the order data replicated is that it arrives in large chunks
irregularly from the mainframe, so while the overall load is small, at times
it can get hectic. Mysterious replication task failures have all but disappeared,
though log reader failures still happen occasionally on the mill setup server,
and whenever we think we have every data audit finally covered, someone
finds a way to get us with an "off by one" error. We are still
tweaking various configuration parameters and scheduled tasks, and we're
monitoring things constantly (at least on the day shift) in our search for
the perfect setup.
What I hope this shows is that with sufficient planning and work, installations
that don't follow perfect textbook examples can be made to function in the
real world. And conversely, that installations done with little or no planning
probably won't work. However, one shouldn't expect much help from conventional
sources (read Microsoft support) when developing and troubleshooting such
installations, because you're likely to get the textbook answer that "you
shouldn't be doing it that way." For now, however, we've increased
the reliability of our applications, which pleases the operating folks greatly,
and apart from saving the corporation money also pleases us from a personal
standpoint, since it directly leads to a drop in those unappreciated 2:00
a.m. telephone calls. And while it might be a stretch to claim that improving
two-way replication's performance helps retain our employees, in these times
of low unemployment, anything that improves morale and decreases midnight
grumbling can't hurt.
David Peregrine Clements, a Maine
native, is now living in self-described "exile" in Mountain Brook,
AL (near Birmingham). He has a B.S. in physics and astronomy from Bowdoin
College in Maine and M.S. and Ph.D. degrees, also in physics and astronomy,
from the University of Delaware. He's currently a process control engineer
at USS and is responsible for maintaining computer models and databases
for the sheet products division of USS. He spent two years as a research
technician with the U.S. Antarctic Research Program -- one year at the South
Pole and one at McMurdo Station. www.concentric.net/~clements.