Troubleshooting Replication

Replication is a complex process, and although Microsoft® SQL Server™ is designed to detect and correct problems automatically, you may encounter problems when implementing your application. Most replication problems can be resolved by following a general troubleshooting approach. Other problems require a specific set of instructions.

General Approach to Replication Troubleshooting

You can begin troubleshooting replication problems by viewing the task history to determine which task failed and the reason for failure. Message details cannot always identify the problem, but can often provide an indicator of the issues, for example, connectivity problems, permissions restrictions, log full errors, and so on.

You can also begin troubleshooting using Replication Monitor to view the status of replication agents:

In Replication Monitor, right-click the distribution agent serving the subscriber, then click Agent Properties. This allows you to view the job properties.
Click the Steps tab, then double-click Run Replication Agent.
Cut the string from the command window.
Paste the string into a console window, with Distrib.exe at the beginning, and with an extra parameter -Output.
Review the output for indications of the problem.

You can increase the amount of history logged by a replication agent and use the additional information to get a more detailed understanding of actions and failures at each point in the replication process. Set the HistoryVerboseLevel to its highest value in the replication agent profile.

If you suspect there is a problem with data consistency in your application, use sp_table_validation to test for row count or checksum differences.

If you encounter one of the following problems, follow the procedures described. For current information about SQL Server replication and suggested solutions to problems, see www.microsoft.com/sql and www.microsoft.com/support.

Cannot Start a Replication Agent

A replication agent may not start because SQLServerAgent always calls the xp_logininfo stored procedure to validate that you still belong to your Microsoft Windows NT® user groups and to verify your login permissions to the server. The called stored procedure always makes a round trip to the domain controller to do this work. If the agent will not start, an error is returned. This error shows only in the Jobs folder. It never gets propagated to the monitoring node because the agents never actually run, so no notification can be returned. The workaround for the agent not starting is to use standard security or a local computer login as the owner for your jobs.

Cannot Start Another Replication Job

Some replication agents allow only one instance of a particular job to run at a time, for example, one log reader per publication database, one Distributor or Merge Agent per publication/Subscriber pair, one snapshot per publication, and so on. If these jobs fail due to connection failure, it is possible you will not be able to start another job until the network connection timeout is reached or you kill the system process ID (SPID) of the failed job.

Cannot Find Conflicts

If you receive a message that conflicts occurred during the merge process, you can use Replication Conflict Viewer to review the outcomes of the conflicts and to make changes to these outcomes. Make sure you connect to the correct server to view the conflicts. The location of the conflict table varies depending upon whether replication has been configured for centralized or decentralized logging of conflicts. Conflict reporting is usually centralized (the default). If centralized, the conflict table is stored at the Publisher and you must connect to the Publisher to view the conflicts. If decentralized, the conflict table is stored at either the Publisher or the Subscriber, depending upon who lost the conflicts.

Access Denied Reading or Writing Snapshot Files

If you cannot start Snapshot Agent, and receive an “Access Denied” message, run Dcomcnfg.exe. Click the Default Security tab, then make sure the Windows NT account that SQL Server Agent runs under is enabled to have default access and launch permissions.

If the replication agents cannot access the snapshot folder on the Distributor, make sure that the folder is shared correctly. On a Distributor server running Windows NT, the snapshot folder defaults to using the <drive>$ share and a path of \\<computer>\<drive>$\Mssql7\Repldata.

On a Distributor server running Microsoft Windows® 95/98, the snapshot folder defaults to using the <drive> without a share and a path of <drive>:\Mssql7\Repldata. If your application requires the ability to create pull subscriptions on a server running Windows 95/98, you must change the snapshot folder to a network path accessible by replication agents running at the Publisher and Subscribers. You can change the local path to a network path by sharing the folder manually.

Replicating from SQL Server 7.0 to SQL Server 6.5

If you are attempting to replicate from SQL Server 7.0 to SQL Server 6.5 and receive an error message that MSreplication_ subscripions is an invalid object error, try the following procedures:

Run Replp70.sql at the SQL Server 6.5 Subscriber.
Run sp_addpublisher at the Subscriber. Make sure you have the syntax correct:

sp_addpublisher 70 <publisher server name>, <NT account used by

Dist. agent>

Register the SQL Server 6.5 Subscriber at the SQL Server 7.0 Enterprise Manager.
Create a publication at the SQL Server 7.0 Publisher and perform a push subscription to the SQL Server 6.5 Publisher.

For more information about the replicating between different versions of SQL Server, see Replicating Between Different Versions of SQL Server.

Merge Agent or Distribution Agent Fails on Time-out

If the Merge Agent or Distribution Agent fails because of a time-out, increase the QueryTimeout value in the Merge Agent or Distribution Agent profile.

Merge Agent Fails While Enumerating Deletions at the Subscriber

If the Merge Agent fails while enumerating deletions at the Subscriber, examine the error details. This condition is often the result of a large number of deletes to process and a small QueryTimeout value. Increase the QueryTimeout value in the Merge Agent profile.

Errors Occur Applying Constraints During the Initialization of a Partitioned, Merge Publication at a Subscriber.

If errors occur when the snapshot of a partitioned merge publication is applied at a Subscriber, the publication filter may not be defining all of the data needed at the Subscriber to support the constraints referenced at the Subscriber. For example, if you have an EMPLOYEES table that contains a self-referencing constraint on the EMPLOYEES.SUPERVISOR_ID column, make sure your partitioned data set includes all of the supervisors for the employees in the partition as well.

Conflicts Occur When Merging Newly Inserted Rows That Contain Identity Columns

If conflicts occur when merging newly inserted rows that contain identity columns, use the Replication Conflict Viewer to determine the cause of the conflict. An insert conflict is usually caused by inconsistent enforcement of constraints between Publishers and Subscribers. Also, identity columns must be used with caution. You must assign each Subscriber that will insert new rows containing an identity a unique range of identity values.

Data Validation Appears to Fail

Data validation uses rowcounts and checksums to determine if data at the Subscriber has diverged from data at the Publisher. However, there are several conditions other than actual data divergence that could cause data validation to fail.

Checksum computes a 32-bit cyclic redundancy check (CRC) on the entire row image on the page. It does not check columns selectively and cannot operate on a view unless it includes all columns of only a single table (a horizontal partition is fine) or vertical partition of the table (in the case of transactional - merge doesn't support vertical partitions period). Also, the checksum skips the contents of text and image columns (by design due to performance considerations).
When doing a checksum, the structure of the table must be identical between the two servers; that is, the tables must have the same columns created and existing in the same order, same data types and lengths, and same NULL/NOT NULL conditions. For example, if the Publisher did a CREATE TABLE, then an ALTER TABLE to add columns, but the script applied at the Publisher is a simple CREATE table, the structure is NOT the same. If you are not certain that the structure of the two tables is identical, look at syscolumns and confirm that the offset for every column in each table is consistent. (Rowcounts would be fine.)
Floating point values are likely to generate checksum differences if character-mode bcp was used, which is the case if the publication has heterogeneous subscribers. These are due to minor and unavoidable differences in precision when doing conversion to and from character mode. Therefore, tables with float or real datatypes which were set up to allow non-sql server subscribers would be likely to generate checksum differences. (Rowcounts would be fine.)
A custom resolver might be written that does not ensure data convergence, or even expressly does not attempt to attain convergence. For example, a custom resolver may by design only resolve downloads to the Subscriber and not uploads to the Publisher.

If you suspect something is wrong, first check conflict tables, error tables for and possible explanations. For example, if the Subscriber database or log is full, SQL Server may be unable to download all of the changes. SQL Server would log an error in the conflict table at the Publisher indicating that not all updates were processed at the Subscriber.

It may be helpful to do a distributed query between the two machines that could pinpoint where the differences lie. Below is an example of a distributed query that would find rows in a table that exist on one side, but not the other, as well as rows that have the same primary key at each side, but which have a different checksum value. Having indentified the problem rows, you could then look at the specific values to attempt to understand if a real problem might exist, and why.

Note You cannot use OPENROWSET on a server running Windows NT Authentication only.

SELECT authors.au_id AS Local_PK, remote.au_id AS Remote_PK,

(getchecksum(authors.au_id,1),0) AS Local_chksum,remchksum

FROM authors

FULL OUTER JOIN

OPENROWSET('SQLOLEDB','SUBSRV';'sa';'sa_password',"SELECT au_id,(getchecksum(NULL,1),0) AS remchksum FROM pubs.dbo.authors" ) AS remote

ON (authors.au_id=remote.au_id)

WHERE

-- Find rows with same primary key but different checksums

(getchecksum(authors.au_id,1),0) <> remchksum

OR

-- Find rows which do not exist on one side or the other

authors.au_id IS NULL OR remote.au_id IS NULL

If the query returns rowcount differences, comment out the checksums. The checksums are unnecessary at that point and are more resource intensive than rowcounts to execute.

For more information on data validation, see Monitoring Data Validity and sp_table_validation.

Overview of Replication	How to monitor replication agent history (Enterprise Manager)
Monitoring Replication	How to monitor replication agent session details (Enterprise Manager)
How to monitor replication alerts (Enterprise Manager)	Replication Conflict Viewer
sp_table_validation	xp_logininfo
IDENTITY (Property)	Use the NOT FOR REPLICATION Option on the IDENTITY Property
CHECK Constraints