Monday, December 14, 2009

Syslog and UDP issues


Going to use this space to outline an issue I'm currently experiencing and hopefully to later post a resolution as I find one.


I recently helped set up a couple syslog servers to consolidate a lot of data for a client. We ran into a problem where data was not being logged even though logs were being sent to the system. We used tcpdump to confirm that the logs were reaching the server and formatted correctly, but syslog was not writing them to disk. However, other logs were being written.

The problem seems to relate to the UDP recv-q. Here's what we were seeing:
[root@test-lab ~]# netstat -anu | grep 514
udp 110160 0 0.0.0.0:514 0.0.0.0:*
That 110160 as best I can tell, is the number of bytes waiting on the interface's buffer, but syslog has not picked them up and processed them. Watching that number of the course of a few days, it would occasionally rise or fall ever so slightly, but I was unable to determine what it was actually doing.
[root@test-lab ~]# netstat -ansu
Udp:
103235 packets received
58248 packets to unknown port received.
425450 packet receive errors
144175 packets sent
Running this command repeatedly, it looked like the packet receive errors were increasing by 20-30 for every 1 packet received. Strange.

We tried making some tweaks to the buffer size. This did not make an immediate difference, but likely requires a reboot, which we were unable to do at the time. Will update sysctl.conf and reboot when available. If I recall correctly, I've seen this behavior before, but the buffer would fill up close to max regardless of how large we set the rmem_max value. Here it's set to 8MB:
[root@test-lab ~]# sysctl net.core.rmem_max
net.core.rmem_max = 131071
[root@test-lab ~]# sysctl -w net.core.rmem_max=8192000
net.core.rmem_max = 8192000
Also, someone suggested we try offloading some of the work to the hardware using UDP Fragmentation Offload. This is disabled by default in Linux. It sounded like a good idea, and makes sense because some of these log messages are quite large and likely fragmented. Unfortunately, it appears our NICs or drivers do not support this feature.
[root@test-lab ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
[root@test-lab ~]# ethtool -K eth0 ufo on
Cannot set device udp large send offload settings: Operation not supported
[root@test-lab ~]#
So, the current plan is to try an increase the buffers, decrease the amount of UDP data being sent to the server, and possibly a hardware upgrade in the form of new NICS.



UPDATE:
Just wanted to add a few more details to this conversation.

During troubleshooting, we made an attempt to figure out which hosts were causing the most UDP traffic that the socket could not keep up with. To do this, I did the following rough command:

#tcpdump -nnn -c 1000 -i bond0 udp and port 514 | awk '{print $3}' | egrep -o "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" | sort | uniq -c | sort -nr

This command spit captured the next 1000 UDP syslog hits, parsed out the source IP addresses, sorted, counted, and then reverse sorted (highest at the top) the results. The results were that some of the firewalls and some of the DNS servers in the local zone were sending tons of UDP logs. After speaking with the owners, we trimmed some of the 'noise' out of the syslogs and also switched over to TCP. After a few of these changes, the Recv-Q fell to 0, and only rarely accumulated up. It would quickly revert back to zero as the application kept up. Additionally our netstat -ansu command showed that packets were being correctly received and not errored. Neat!

Just to be sure, we also ran #netstat -ant and #netstat -antu to make sure that the TCP recv-q wasn't filling, and there were no packets being dropped on the TCP side. It looks like we were able to resolve the issue.


No comments:

Post a Comment

Comments are moderated and will appear only after being reviewed.