Advanced Troubleshooting with Strace

Sometimes a site is performing erratically, or loading slowly and it’s not evident what the problem is. When you’ve run out of standard troubleshooting methods, it might be time to go deeper.

We need to go deeper.
We need to go deeper.

One way to do that is with a tool called strace. Strace allows you to track the system calls to the kernel in real time.

You can pass it a process id, or run it in front of a command.

Quick example:

Let’s use the [shell]-e trace[/shell] option to tell strace what type of system call we’re interested in. We want to see what files it’s opening. We have a suspicion that running the host command will attempt to check our [shell]/etc/resolv.conf[/shell] before querying the internet for an A record, so let’s verify that.

[shell]
$ strace -e trace=open host google.com 2>&1 | grep resolv.conf
open(“/etc/resolv.conf”, O_RDONLY|O_LARGEFILE) = 6
[/shell]

As we expected, it does make an attempt to open that file.

Note that I redirected STDERR to STDOUT so I could grep the output. strace writes its output to STDERR. I won’t go into too much more detail about strace for now, but you get the idea.

Now back to our hypothetical slow or erratic website issue. The first step to troubleshooting an issue is duplicating the problem. The second step is making it repeatable. The third step is isolating the problem so you can pick it apart and examine it. When dealing with a busy webserver, the problem with doing that last step is that you don’t know which apache PID is serving you, so you can’t very well isolate it if you don’t know which one to iolate.

There are some hacky workarounds for isolating the apache process id that’s serving your HTTP requests. You can telnet to the server from the server, and find the pid via lsof, or netstat:

[shell]
$ telnet localhost 80
GET / HTTP/1.1
Host: slow-domain.com
[/shell]

Then open another screen on the server, and find your telnet pid with netstat:
[shell]
$ netstat -tapn

[..snip..]
tcp6 0 0 127.0.0.1:80 127.0.0.1:40402 ESTABLISHED 20008/apache2
tcp 0 0 127.0.0.1:40402 127.0.0.1:80 ESTABLISHED 23955/telnet
[/shell]

From this we know that process id 20008 is serving my telnet request because the remote and destination ports match. Then you can strace that PID, and quickly give your HTTP request in your telnet session the final carriage return to send the request. But this is clunky, and has race condition issues, and frankly it’s hard to get right.

But there is a better way. You can launch another instance of apache on different ports, say 81, and 444 (for https). Set the MaxClients value to 1, so only you can access it, then add an iptables rule to only allow your remote ip to access those destination ports.

Here’s an example of how you can do this on a cPanel server. Keep in mind, you may not need to copy everything like I did, but I just wanted to make sure I had an exact replica running on the alternate ports. You might want to exclude large log files and such if your apache diretory is large.

Clone the apache directory in full (binaries, conf, everything)
[shell]
cp -r /usr/local/apache /usr/local/apache-tmp
[/shell]
Change ports for http and https so we can run ours without affecting the regular apache
[shell]
$ cd /usr/local/apache-tmp
$ find . -type f -exec sed -i ‘s/:80/:81/g’ {} \;
$ find . -type f -exec sed -i ‘s/:443/:444/g’ {} \;
[/shell]
Only allow one maxclient, so we can find the apache pid serving us when we hit the site
[shell]
$ find . -type f -exec sed -i ‘s/MaxClients.*/MaxClients\ 1/g’ {} \;
[/shell]
Modify all absolute path references to the normal apache dir to our cloned one
[shell]
$ find . -type f -exec sed -i ‘s/\/usr\/local\/apache/\/usr\/local\/apache-tmp\//g’ {} \;
[/shell]
Now we can start our cloned apache on alternate ports 81,444 with just one maxclient allowed. You should then be able to access every site on the server via the alternate ports.
[shell]
$ httpd -d /usr/local/apache-tmp/ -f /usr/local/apache-tmp/conf/httpd.conf
[/shell]
That launched the root httpd process with one child pid as expected
Now find the CHILD pid, try:
[shell]
$ ps auxf | grep apache-tmp
[/shell]
Now to attach strace to the one and only apache process.
[shell]
$ strace -p PID_HERE -f -s 2048
[/shell]
The -f option tells strace to follow child processes.
The -s option specifies how many bytes of each call to capture. 2048 might be overkill, so feel free to adjust this.

Then make the http request:
[shell]
$ curl slowsite.com:81/badcode.php
[/shell]

This is definitely a drastic troubleshooting method, but it’s great for those times when you hit a brick wall diagnosing a slow-loading, or erratic behaving site and feel compelled to find find the issue.

Note: cPanel changes directory structure with updates from time to time. This was done a few months ago on a cpanel 11.40 build I believe. YMMV, use this tactic with caution.

Leave a Reply

Your email address will not be published.