I have been seeing this error off and on past month, month and a half. The error was only seen coming from 7 servers and randomly. Another reason it took so long to find and determine the cause was due to trying to collect the errors and investigate what was going on during those times.
DESCRIPTION: Server TCP provider has stopped listening on port [ 1433 ] due to a failure. Error: 0x2747, state: 2. The server will automatically attempt to reestablish listening.
When researching the error I found several forum and blog posts related to the error message. This one one indicates it may be the server under powered for the workload and it is saturating the buffer space in the processor. To read more follow this link.
Here is an excerpt from the thread.
“The errors above are pretty critical. 0x2747 (10055) maps to WSAENOBUFS (An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.). In general this indicates issues with the paged or non-paged pool in the kernel. The simultaneous IO hick-up is something very bad, too – corruption or data loss may occur if SQL is not able to write to the IO”
Another link below indicates the same information as above in the processor buffers are being saturated or the server is under powered.
To get more information in an effort to determine if the servers were under powered I created a server sides trace using the Errors and Warning Event Category. This group also has the CPU Threshold Exceeded Event. Keep in mind if you use the same group you may capture a lot of information. I would suggest you closely monitor the amount of data and the drive space where you are saving the traces too so you don’t fill the drive causing further issues.
Other Things to Consider
While investigating the above error there was an email thread started from within the System Admin group, at some point we (DBAs) were included. The discussion was around the product we use enterprise wide for Host Intrusion / Firewall / Virus Protection. In the email thread it was mentioned trouble with automatic updating of said software and issues being seen by SAs on application servers after the updates. The solution they used was to disable parts of the software.
I then reached out to one of the SAs to get more details. After obtaining the details I logged on to the servers in question and disabled the HIPS service. I documented what was done for future issues and to cover my “you know what”. After a week we have not seen any of the above errors. I will continue to monitor and gather data to make sure we are not having the other issues mentioned but so far I have not seen any.
Keep in mind researching an error, and finding a possible fix may not be the solution for you in your environment. The error mentioned usually indicates issues with the kernel and associated with heavy workloads. This information I was able to gather did not indicate that at all.
Trouble shooting errors is a time consuming process. It took over a month working this issue. I will continue to gather information and monitor the servers in question but for now things appear to be solved.