While troubleshooting push notification failure issues with a client, I found an interesting problem. The client had already configured the SRV record as required (http://blog.ucmadeeasy.com/), and disabled the URL filtering as required (http://support.microsoft.com/kb/2664650), but push notification was still failing with a 504 error code. To take it one step further, we completely disable all IM filtering just in case. However, we still received a 504 error (server timeout) from the Push service.
As background, the Push Notification Clearing House (PNCH) runs in the office 365 Cloud using Lync Edge servers and dynamic federation. For more information on the 3 types of federation, refer to the article I wrote here: http://ocsguy.com/2011/04/20/a-few-words-on-federation/.
We were unable to troubleshoot the issue from the Office 365 side, so I decided to reconfigure my company’s Edge server with dynamic federation (it was configured with direct federation) and see if I could find any errors related to the customers configuration.
I began by removing the customer’s domain information from the Federated Domains tab within the Lync Server Control Panel in my Lync environment. Next I signed into a test account on the customers Lync environment (email@example.com) and attempted to IM an account in my environment (firstname.lastname@example.org). The IM failed immediately and I began reviewing the UCCAPI log from my client and the SIP Stack logs from my Edge servers.
It didn’t take long to find the 504 error in the logs, including some useful diagnostic information:
In the “ms-diagnostics” line we see “No match for domain in DNS SRV results” followed by the domain name (contoso.com) and the A record usim.us.contoso.com.
The problem lies within the A record the Federation SRV record is using. It doesn’t match the SIP domain.
Now you may be thinking “they are both contoso.com”, and if you are, you are not alone! The catch however, is there is a subdomain in the A record (usim.US.contoso.com) that does not exist in the SIP domain. This causes a failure to match the SIP domain with the SRV record. Since they don’t match, you would have to use Direct Federation instead of Enhanced or Dynamic Federation to federate with this organization. This would seem to be an easy fix, but since Office 365 only supports Dynamic Federation, the fix is a configuration change on the customer side.
To resolve the issue we created a new DNS A record to use for Federation (sip.contoso.com). We also updated the Access Edge certificate to include this name in the SAN field. Once these steps were completed, Push Notification began working on the mobile clients.
Lesson learned: As a best practice, make sure you’re DNS A records for Federation don’t have a subdomain unless your sip domain does as well.