M
maitakeboy
I am stumped by an issue that cropped up a couple days ago. My network has an HQ with VPN links to 3 regional offices. On Tuesday morning, out of the blue, people in HQ began having difficulty with their mapped drives to one, and subsequently two, of the regional offices.
We had a planned power outage in the first affected site the night before, so the server was shut down cleanly then restarted in the morning normally, with no problems. Then the calls started coming in.
I shut down the entire network infrastructure in the regional office and brought it up in proper order, thinking it was just a routing table or other anomaly due to the order in which the routers, firewalls and switches came up. After bringing everything up in correct order, retested, same behavior.
The symptoms of the behavior were that users could map a drive. When they clicked on the mapped drive, all the folders would enumerate, but then it would blink, appear empty, then repopulate. Oftentimes the user could actually then go in and browse around the folders and subfolder, maybe even open or copy a file.But within a minute, they would eventually get an error that said the network connection was lost, or that "An error occurred reconnecting to X:, the local device name is already in use" Wait a second, and you can get back in, but only for less than a minute.
This was only for users in HQ going to two of the regional office shares. Users in the regional offices going to shares in the HQ had no problems. Users on the local LAN in the regional offices experienced no problems. However, a user in one regional office reported the behavior when he tried to connect to a share in one of the other two affected regional offices
So these are the steps we took:
None of these made any change in the behavior. However, when we mapped a drive to one of the affected regional offices using IP instead of computer name, we were able to get a stable connection. This did not work in the other affected regional office, however.
So you see my dilemma. Nothing really adds up. The symptoms are all anomalous. If it was teh VPN, we should be seeing the problem in both directions. We are not. If it was simply a local issue in the regional offices, we would expect to see similar behavior on the LAN. We are not. If we implement a fix that causes some partial solution, we would expect it to work on the other device. It does not.
I am reduced to considering it the work of Russians, or cosmic rays, or ghosts. I am stumped.
Any help would be greatly appreciated.
Continue reading...
We had a planned power outage in the first affected site the night before, so the server was shut down cleanly then restarted in the morning normally, with no problems. Then the calls started coming in.
I shut down the entire network infrastructure in the regional office and brought it up in proper order, thinking it was just a routing table or other anomaly due to the order in which the routers, firewalls and switches came up. After bringing everything up in correct order, retested, same behavior.
The symptoms of the behavior were that users could map a drive. When they clicked on the mapped drive, all the folders would enumerate, but then it would blink, appear empty, then repopulate. Oftentimes the user could actually then go in and browse around the folders and subfolder, maybe even open or copy a file.But within a minute, they would eventually get an error that said the network connection was lost, or that "An error occurred reconnecting to X:, the local device name is already in use" Wait a second, and you can get back in, but only for less than a minute.
This was only for users in HQ going to two of the regional office shares. Users in the regional offices going to shares in the HQ had no problems. Users on the local LAN in the regional offices experienced no problems. However, a user in one regional office reported the behavior when he tried to connect to a share in one of the other two affected regional offices
So these are the steps we took:
- We updated the NIC drivers (the machines that are experiencing this are Proliant DL 390 gen 8 & gen 9. The third regional office, that is not experiencing this issue has a server 2012r2 but it is a VM running on a QNAP NAS)
- Rebooted and checked the VPNs for errors, retransmits, dropped packets, etc.. Both tunnels were clean.
- Looked at the internet connections at all the sites. Everything was clean except for alot of dropped packets on one of the HQ feeds. Opened a support case with provider. Still waiting on a resolution there.
- Checked the switches in the regional offices. No obvious problems
- Changed the switch port, the server network port and the network cable at one site.
- Checked the time on each of the machines. Everything was in sync.
None of these made any change in the behavior. However, when we mapped a drive to one of the affected regional offices using IP instead of computer name, we were able to get a stable connection. This did not work in the other affected regional office, however.
So you see my dilemma. Nothing really adds up. The symptoms are all anomalous. If it was teh VPN, we should be seeing the problem in both directions. We are not. If it was simply a local issue in the regional offices, we would expect to see similar behavior on the LAN. We are not. If we implement a fix that causes some partial solution, we would expect it to work on the other device. It does not.
I am reduced to considering it the work of Russians, or cosmic rays, or ghosts. I am stumped.
Any help would be greatly appreciated.
Continue reading...