Replication is a complex process, and although Microsoft® SQL Server™ is designed to detect and correct problems automatically, you may encounter problems when implementing your application. Most replication problems can be resolved by following a general troubleshooting approach. Other problems require a specific set of instructions.
You can begin troubleshooting replication problems by viewing the task history to determine which task failed and the reason for failure. Message details cannot always identify the problem, but can often provide an indicator of the issues, for example, connectivity problems, permissions restrictions, log full errors, and so on.
You can also begin troubleshooting using Replication Monitor to view the status of replication agents:
You can increase the amount of history logged by a replication agent and use the additional information to get a more detailed understanding of actions and failures at each point in the replication process. Set the HistoryVerboseLevel to its highest value in the replication agent profile.
If you suspect there is a problem with data consistency in your application, use sp_table_validation to test for row count or checksum differences.
If you encounter one of the following problems, follow the procedures described. For current information about SQL Server replication and suggested solutions to problems, see www.microsoft.com/sql and www.microsoft.com/support.
A replication agent may not start because SQLServerAgent always calls the xp_logininfo stored procedure to validate that you still belong to your Microsoft Windows NT® user groups and to verify your login permissions to the server. The called stored procedure always makes a round trip to the domain controller to do this work. If the agent will not start, an error is returned. This error shows only in the Jobs folder. It never gets propagated to the monitoring node because the agents never actually run, so no notification can be returned. The workaround for the agent not starting is to use standard security or a local computer login as the owner for your jobs.
Some replication agents allow only one instance of a particular job to run at a time, for example, one log reader per publication database, one Distributor or Merge Agent per publication/Subscriber pair, one snapshot per publication, and so on. If these jobs fail due to connection failure, it is possible you will not be able to start another job until the network connection timeout is reached or you kill the system process ID (SPID) of the failed job.
If you receive a message that conflicts occurred during the merge process, you can use Replication Conflict Viewer to review the outcomes of the conflicts and to make changes to these outcomes. Make sure you connect to the correct server to view the conflicts. The location of the conflict table varies depending upon whether replication has been configured for centralized or decentralized logging of conflicts. Conflict reporting is usually centralized (the default). If centralized, the conflict table is stored at the Publisher and you must connect to the Publisher to view the conflicts. If decentralized, the conflict table is stored at either the Publisher or the Subscriber, depending upon who lost the conflicts.
If you cannot start Snapshot Agent, and receive an “Access Denied” message, run Dcomcnfg.exe. Click the Default Security tab, then make sure the Windows NT account that SQL Server Agent runs under is enabled to have default access and launch permissions.
If the replication agents cannot access the snapshot folder on the Distributor, make sure that the folder is shared correctly. On a Distributor server running Windows NT, the snapshot folder defaults to using the <drive>$ share and a path of \\<computer>\<drive>$\Mssql7\Repldata.
On a Distributor server running Microsoft Windows® 95/98, the snapshot folder defaults to using the <drive> without a share and a path of <drive>:\Mssql7\Repldata. If your application requires the ability to create pull subscriptions on a server running Windows 95/98, you must change the snapshot folder to a network path accessible by replication agents running at the Publisher and Subscribers. You can change the local path to a network path by sharing the folder manually.
If you are attempting to replicate from SQL Server 7.0 to SQL Server 6.5 and receive an error message that MSreplication_ subscripions is an invalid object error, try the following procedures:
sp_addpublisher 70 <publisher server name>, <NT account used by
Dist. agent>
For more information about the replicating between different versions of SQL Server, see Replicating Between Different Versions of SQL Server.
If the Merge Agent or Distribution Agent fails because of a time-out, increase the QueryTimeout value in the Merge Agent or Distribution Agent profile.
If the Merge Agent fails while enumerating deletions at the Subscriber, examine the error details. This condition is often the result of a large number of deletes to process and a small QueryTimeout value. Increase the QueryTimeout value in the Merge Agent profile.
If errors occur when the snapshot of a partitioned merge publication is applied at a Subscriber, the publication filter may not be defining all of the data needed at the Subscriber to support the constraints referenced at the Subscriber. For example, if you have an EMPLOYEES table that contains a self-referencing constraint on the EMPLOYEES.SUPERVISOR_ID column, make sure your partitioned data set includes all of the supervisors for the employees in the partition as well.
If conflicts occur when merging newly inserted rows that contain identity columns, use the Replication Conflict Viewer to determine the cause of the conflict. An insert conflict is usually caused by inconsistent enforcement of constraints between Publishers and Subscribers. Also, identity columns must be used with caution. You must assign each Subscriber that will insert new rows containing an identity a unique range of identity values.
Data validation uses rowcounts and checksums to determine if data at the Subscriber has diverged from data at the Publisher. However, there are several conditions other than actual data divergence that could cause data validation to fail.
If you suspect something is wrong, first check conflict tables, error tables for and possible explanations. For example, if the Subscriber database or log is full, SQL Server may be unable to download all of the changes. SQL Server would log an error in the conflict table at the Publisher indicating that not all updates were processed at the Subscriber.
It may be helpful to do a distributed query between the two machines that could pinpoint where the differences lie. Below is an example of a distributed query that would find rows in a table that exist on one side, but not the other, as well as rows that have the same primary key at each side, but which have a different checksum value. Having indentified the problem rows, you could then look at the specific values to attempt to understand if a real problem might exist, and why.
Note You cannot use OPENROWSET on a server running Windows NT Authentication only.
SELECT authors.au_id AS Local_PK, remote.au_id AS Remote_PK,
(getchecksum(authors.au_id,1),0) AS Local_chksum,remchksum
FROM authors
FULL OUTER JOIN
OPENROWSET('SQLOLEDB','SUBSRV';'sa';'sa_password',"SELECT au_id,(getchecksum(NULL,1),0) AS remchksum FROM pubs.dbo.authors" ) AS remote
ON (authors.au_id=remote.au_id)
WHERE
-- Find rows with same primary key but different checksums
(getchecksum(authors.au_id,1),0) <> remchksum
OR
-- Find rows which do not exist on one side or the other
authors.au_id IS NULL OR remote.au_id IS NULL
If the query returns rowcount differences, comment out the checksums. The checksums are unnecessary at that point and are more resource intensive than rowcounts to execute.
For more information on data validation, see Monitoring Data Validity and sp_table_validation.