Category Archives: Network Troubleshooting

Packet Capture on Both Sides of a Conversation

Greetings everyone!! Yes, it’s been a while…sorry about that. Been busy with life…I’ll just leave it at that.

So at work, for the last couple of days, the Citrix admin has been having an issue with users at one of our larger remote sites…seems they are intermittently unable to connect to a Citrix server after being redirected by the Citrix license server. He brought me into the problem this morning once he realized the problem was only occurring at this one site. Very interesting!

First step was to capture some of the traffic and see what’s going on. I have a Linux server at the Data Center running Snort, watching most all of the traffic into and out of the Data Center, so this comes in very handy when I need to capture some traffic. I started a TCPDUMP on the server, specifying the interface in question that has the traffic from the remote site, and dumped it into a file…the command was…

tcpdump -vv -nn -s 0 -i eth3 -w /root/pcapfiles/citrix_issue.pcap host 10.10.21.223

The Citrix admin was remoted onto a PC at the site, and he attempted to connect into Citrix. After I captured the data, I moved the file to my laptop and opened it up with Wireshark…I saw the 3 way handshake (SYN – SYN/ACK – ACK), and then some data going back and forth, but the session never started up and it timed out. Weird.

I also captured a Wireshark session from my own laptop so I could see how it was supposed to work. I still could not see what the issue was…very strange.

I then decided that I needed to see packet captures from both sides of the conversation. So I remoted into the PC and installed Wireshark on it…I then started up Wireshark on the PC and TCPDUMP on my Linux server, and then tried the Citrix client again. After it timed out, I opened up both PCAP files in Wireshark and examined them side by side, packet by packet. On the PC, I saw this…(PC = 10.10.21.223, and Citrix server = 10.12.1.122)…

Packet capture on PC side showing initial SYN packet

Packet capture on PC side showing initial SYN packet

I found the matching packet on the capture from the Data Center…

Packet capture from Citrix server side showing full handshake

Packet capture from Citrix server side showing full handshake

Say what????? I see a completed 3-way handshake…including the SYN/ACK from the server back to the PC (which was not in the PC capture), and another packet from the PC to the server completing the handshake…also not in the PC capture. VERY bizarre!!!

Then it hit me…Riverbed!! The only way a handshake could be completed in this manner was from another device sitting in-between…and we do have that. We have Riverbed devices sitting at the Data Center and at most large remotes sites handling WAN optimization. Since no other sites were reporting any issues, then the Riverbed at this remote site must be “kinked”. So we restarted the Citrix application optimization process on the Riverbed, and that fixed everything!

VERY cool…and very interesting. This took some time to figure out, but once I got visibility at both ends of the conversation, the answer was easy. Remember…Wireshark is your friend.

Troubleshooting – Update on My T1 Circuit Issue

As a followup to my post last week concerning troubleshooting a problem T1 circuit, it looks like we are finally making some progress. After working on the issue yesterday, a cable specialist was dispatched out to work on narrowing down exactly what and where the problem lies along the cable span. The carrier sent me an update stating that the LEC (Local Exchange Carrier) has:  “dispatched a cable specialist who has determined there is an unbalanced signal between the last repeater in the local loop distribution plant and the customer premises”.

An “unbalanced signal”….well, that’s a new one for me. But hey, as long as it gets fixed, I’m fine with that.

When I first arrived at work this morning, I checked my stats on the router and verified the circuit was still taking heavy errors…

T1 circuit still taking heavy errors

T1 circuit still taking heavy errors

Later in the day I received an update from the carrier stating that a repair had been made, and that they showed the circuit running clean for the last two hours. Hmmm…really…let me check…

Circuit running clean for the last two hours

Circuit running clean for the last two hours

Well, well…the circuit really is running clean for the last two hours (8 intervals). NICE!!

One more thing to do…clear the counters on the serial interface. Take a look at the stats on the serial interface…LOTS of accumulated errors over the last 2+ years….so lets clear all those stats and keep track of the circuit from this point on…

Show interface information, and clearing the stats

Show interface information, and clearing the stats

I’ll be keeping track of this T1 over the next several days to see if the repair made by the carrier really did fix the issue. Fingers are crossed!!

Troubleshooting – T1 Circuit Errors and Controller Stats

For the last several weeks, I’ve been having a T1 circuit issue at one of my remote sites. The carrier has been working the problem, but the issue is intermittent and difficult to narrow down. This site is way out in the boonies, and I think some of the cable span is old and some moisture has leaked into the cable. So, what can you do to see the health of a T1 circuit? Take a look at the controller stats using the command…

show controller t1 0/0/0         (use the appropriate card slot numbering for your interface)

Each Cisco router keeps a log of the errors on a T1 circuit for the past 24 hours, in 15 minute blocks…so 96 “intervals” as we say. Take a look at this snippet of a clean running T1 circuit…

Example of a clean running T1 circuit

Example of a clean running T1 circuit

The first data interval is for the current 15 minute block, and shows the elapsed time…in this case 351 seconds. After that, each interval is a full 15 minutes, and this sample shows a very clean running T1 circuit. Notice the last block of data shows the summary of all errors for the preceding 24 hours (96 intervals). I sure wish all my T1’s ran this clean.

Now, here is a snippet from my problem T1 taken earlier today…

Controller stats of a T1 circuit having physical layer issues

Controller stats of a T1 circuit having physical layer issues

A bit messy wouldn’t you say? The first 3 intervals show a circuit up and running, but VERY poorly…few, if any, applications would work properly over this type of circuit (and they weren’t, which my end customer could vouch for). Take a look at interval 17…there are 900 unavailable seconds, which is how many seconds there are in 15 minutes. So for this interval, the circuit was completely down. And notice the Total Data for all intervals…this circuit is indeed in very poor health.

What does this information tell you? Basically, with this kind of high error rate, the problem is almost always with the carrier (issues with the cable span, NIU, or Central Office equipment). In all my years of troubleshooting T1 circuits, I’ve only had a few times where the issue was on my side (it was cabling issues with my extended DMARC usually). And remember, you can copy this information and send it to the carrier to help prove your case.

Hope this helps!

How to Stress Test a T1 Circuit

Most all networks have T1 circuits, the most common being either an MPLS T1 or Internet T1. There will be times when one of your T1 circuits will be acting up in a sporadic manner, causing “slowness” for your end users, and will require you to be more proactive in troubleshooting the root cause. This post will talk about how to stress test a T1 using PING.

First off, understand that using a PING command with the default parameters will tell you if the circuit is up or down, and it may show problems such as large latency or excessive drops. But to really test a T1, you need to modify the use of PING to perform a more thorough test. Common actions are to increase the packet size and frequency of pings to better test throughput, and to use specific data patterns to better test the operation of the T1.

You can use Cisco’s PING which is part of IOS.  Here is an example of an extended PING where we increase the packet size to the max MTU of 1500 bytes, and run all 1’s (which will provide additional stress on the circuit)…

Using Cisco's IOS ping command

Using Cisco’s IOS ping command

A much better PING to use though is Linux, with the “flood” option, as this will allow you to really hammer the T1 circuit. (Note…you need to be root to use the flood option.) The difference is this…Cisco’s PING will send an echo-request, but will wait for the echo-reply before it can send another echo-request. This greatly reduces the amount of ping traffic IOS can send across the T1. Linux however, will immediately start sending as many packets as it can, up to 100 per second. For each echo-request packet it sends, it prints a “.” (dot) on the screen. For each echo-reply it receives, it prints a back-space. So if you only see a couple of dots, then the circuit is handling your ping flood easily. However, if you start seeing dots race across the screen, then there are problems. Here is a Linux PING flood example with 1500 byte packets and running all 1’s…

Using Linux ping -f (flood) option to stress a T1 circuit

Using Linux ping -f (flood) option to stress a T1 circuit

As you can see, there are only three dots…7356 packets were sent and 7353 were received. That leaves 3 missing packets. This T1 easily handled this test. Plus a Linux ping flood will typically load up a T1 in the range of 700-900 Kb (about 1/2 of a T1 circuit). If you really want to fully load up a T1, run two different instances of ping flood, and you will see a T1 circuit fully saturated (or near so). Of course, do NOT do this during normal business operations…you will heavily impact the end users, and they will not be happy. When running the Linux ping flood shown above, the resulting bandwidth impact on the T1 was…

Using "show interface" to see bandwidth impact of ping flood

Using “show interface” to see bandwidth impact of ping flood

In my next post I will give an example of how I used ping flood to troubleshoot a T1 circuit whose performance was impacted by a unique problem.