Jan 09 2012

Distributed MySQL Sleuthing on the Wire

Category: Databases,Real-time Web,SSH,Systemsjgoulah @ 8:52 AM

Intro

Oftentimes you need to know what MySQL is doing right now and furthermore if you are handling heavy traffic you probably have multiple instances of it running across many nodes. I’m going to start by showing how to take a tcpdump capture on one node, a few ways to analyze that, and then go into how to take a distributed capture across many nodes for aggregate analysis.

Taking the Capture

The first thing you need to do is to take a capture of the interesting packets. You can either do this on the MySQL server or on the hosts talking to it. According to this percona post this command is the best way to capture mysql traffic on the eth0 interface and write it into mycapture.cap for later analysis:

% tcpdump -i eth0 -w mycapture.cap -s 0 "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
47542 packets captured
47703 packets received by filter
60 packets dropped by kernel

Analyzing the Capture

The next step is to take a look at your captured data. One way to do this is with tshark, which is the command line part of wireshark. You can do yum install wireshark or similar to install it. Usually you want to do this on a different host than the one taking traffic since it can be memory and CPU intensive.

You can then use it to reconstruct the mysql packets like so:

% tshark -d tcp.port==3306,mysql -T fields -R mysql.query -e frame.time -e ip.src -e ip.dst -e mysql.query -r mycapture.cap

This will give you the time, source IP, destination IP, and query but this is still really raw output. Its a nice start but we can do better. Percona has released the Percona Toolkit which includes some really nice command line tools (including what used to be in Maatkit).

The one we’re interested in here is pt-query-digest

It has tons of options and you should read the documentation, but here’s a few I’ve used recently.

Lets say you want to get the top tables queried from your tcpdump

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --order-by Query_time:cnt \
 --report-format profile --limit 5
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M   Item
# ==== ======== ============= ===== ====== ==== ===== ====================
#    1 0x        0.3140  6.1%   674 0.0005 1.00  0.00 shard.images
#    2 0x        0.8840 17.1%   499 0.0018 1.00  0.03 shard.activity
#    3 0x        0.1575  3.1%   266 0.0006 1.00  0.00 shard.listing_images
#    4 0x        0.1680  3.3%   265 0.0006 1.00  0.00 shard.connection_edges_reverse
#    5 0x        0.0598  1.2%   254 0.0002 1.00  0.00 shard.listing_translations
# MISC 0xMISC    3.5771 69.3%  3534 0.0010   NS   0.0 <86 ITEMS>

Note the tcpdump options I used this time, which the tool requires to work properly when passing –type tcpdump. I also grouped by tables (as opposed to full queries) and ordered by the count (the Calls column). It will stop at your –limit and group the rest into MISC so be aware of that.

You can remove the –order-by to sort by response time, which is the default sort order, or provide other attributes to sort on. We can also change the –report-format, for example to header:

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --report-format header
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Overall: 5.49k total, 91 unique, 321.13 QPS, 0.30x concurrency _________
# Time range: 2012-01-08 15:52:05.814608 to 15:52:22.916873
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             5s     3us   114ms   939us     2ms     3ms   348us
# Rows affecte         316       0      13    0.06    0.99    0.29       0
# Query size         3.64M      18   5.65k  694.98   1.09k  386.68  592.07
# Warning coun           0       0       0       0       0       0       0
# Boolean:
# No index use   0% yes,  99% no

If you set the –report-format to query_report you will get gobs of verbose information that you can dive into and you can use the –filter option to do things like getting slow queries:

% tcpdump -r mycapture.cap -n -x -q -tttt | \
  pt-query-digest --type tcpdump --filter '($event->{No_index_used} eq "Yes" || $event->{No_good_index_used} eq "Yes")'

Distributed Capture

Now that we’ve taken a look at capturing and analyzing packets from one host, its time to dive into looking at our results across the cluster. The main trick is that tcpdump provides no option to stop capturing – you have to explicitly kill it. Otherwise we’ll just use dsh to send our commands out. We’ll assume you have a user that can hop around in a password-less fashion using ssh keys – setting that up is well outside the scope of this article but there’s plenty of info out there on how to do that.

There’s a few ways you can let a process run on a “timeout” but I’m assuming we don’t have any script written or tools like bash timeout or the one distributed in coreutils available.

So we’re going off the premise that you will background the process and kill it after a sleep by grabbing its pid:

( /path/to/command with options ) & sleep 5 ; kill $!

Simple enough, except we’ll want to capture the output on each host, so we need to ssh the output back over to the target using a pipe to grab the stdout. This means that $! will return the pid of our ssh command instead of our tcpdump command. We end up having to do a little trick to kill the right process, since the capture won’t be readable if we kill ssh command that is writing the output. We’ll need to kill tcpdump and to do that we can look at the parent pid of the ssh process, ask pkill (similar to pgrep) for all of the processes that have this parent, and finally kill the oldest one, which ends up being our tcpdump process.

Then end result looks like this if I were to run it across two machines:

% dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

So this issues a dsh to two of our hosts (you can make a dsh group with 100 or 1000 hosts though) and runs the command concurrently on each (-c). We issue our tcpdump on each target machine and send the output to stdout for ssh to then cat back to a directory on the source machine that issued the dsh. This way we have all of our captures in one directory with each file named with the target name of each host the tcpdump was run. The sleep is how long the dump is going to run for before we then kill off the tcpdump.

The last piece of the puzzle is to get these all into one file and we can use the mergecap tool for this, which is also part of wireshark:

% /usr/sbin/mergecap -F libpcap -w output.cap *.cap

And then we can analyze it like we did above.

Further Reading

References

http://www.mysqlperformanceblog.com/2011/04/18/how-to-use-tcpdump-on-very-busy-hosts

http://stackoverflow.com/questions/687948/timeout-a-command-in-bash-without-unnecessary-delay

http://www.xaprb.com/blog/2009/08/18/how-to-find-un-indexed-queries-in-mysql-without-using-the-log/

Breaking the distributed command down further

Just to clarify this command a bit more, particularly how the kill part works since that was the trickiest part for me to figure out.

When we run this

$ dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

on the server the process list looks something like

user     12505 12504  0 03:12 ?        00:00:00 bash -c sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | ssh myhost.myserver.com "cat - > /home/etsy/captures/$(hostname -a).cap" & sleep 5 ; sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")
pcap     12506 12505  1 03:12 ?        00:00:00 /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2
user     12507 12505  0 03:12 ?        00:00:00 ssh myhost.myserver.com cat - > ~/captures/web1001.cap

So $! is going to return the pid of the ssh process, 12507. We use awk to find the process matching that, and then print the parent pid out, which is then passed to the -P arg of pkill. If you use pgrep to look at this without the -o you’d get a list of the children of 12505, which are 12506 and 12507. The oldest child is the tcpdump command and so adding -o kills that guy off.

So if we were only running the command on one host we could use something much simpler

ssh dbhost01 '(sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 port 3306) & sleep 10; sudo kill $!' | cat - > output.cap

Tags: , , , , ,


Mar 15 2010

Node.js, Websockets, and the Twitter Gardenhose

Category: Real-time Webjgoulah @ 3:08 PM

Introduction

In this article I’m going to demonstrate how to use the Twitter Streaming API to send a stream of status updates for real time display in a web browser using Web Sockets. We’ll implement a backend server module that streams the events over a web socket using the node.websocket.js framework, which provides a simple realtime HTTP server that implements the web socket protocol. Node.websocket.js is built on top of Node.js which provides an Asynchronous API using callbacks. Node.js is a server side javascript library that enforces an event driven style of programming which allows you to develop non-blocking code easily. This in turn allows you to write simple servers that are very CPU and Memory efficient because you don’t have multiple threads taking up shared resources. Node.js is implemented using Google’s V8 javascript engine and also the CommonJS specification which is defining a standard library for server side javascript.

Getting Started

First things first, you need node.js, which you can get from github. Install in the usual way

$ git clone git://github.com/ry/node.git
$ cd node
$ ./configure
$ make
$ sudo make install

Then we’ll need node.websocket.js which you can also get from github

$ git clone git://github.com/Guille/node.websocket.js.git

This is basically an experimental implementation of the web socket API. You can create simple server side modules and then a client side implementation that opens up a web socket to the server in which you can then exchange data in both directions. There are a few examples included that you can take a look at including a simple echo server and a chat server. Just fire up the server like so

$ node runserver.js

and then open up one of the html files in the test/websocket/ directory in your browser. One catch though, you’ll need a current browser such as Google Chrome. I’d recommend it anyway, as browsing the web with it does run a bit smoother.

You’ll also need curl, which is pretty common on any linux box these days and can be installed through your package manager.

Setting Up the Server

The way we’re going to implement the server is to use the curl command to pull from the twitter stream into a file. Twitter gives you a bunch of JSON objects back which you can then parse and display. We’ll use this file in a moment to send the data over the web socket to the browser. There is some documentation on what you can do with the API but we’ll keep it simple and search for any tweets with ‘nyc’ in them

$ curl -dtrack=nyc http://stream.twitter.com/1/statuses/filter.json  -uUSERNAME:PASSWORD > sample.json

Now its time to write some server side javascript. In the node.websocket.js checkout there is a modules/ directory. We’ll create a new module called gardenhose.js. So the full path is node.websocket.js/modules/gardenhose.js

In a nutshell this is waiting for the client to establish a connection, and then it creates a child process that tails our file from the curl command above. Anytime the file is written to the “output” listener is invoked, which runs our callback to parse the JSON into objects that we can then use to send a string back to the client with some readable information from the stream.

Lets break it down just a bit more in case you are not familiar with Node. First we are requiring the system and filesystem modules from Node. Now, node.websocket.js basically just uses the node API to implement a server in the websocket.js file. It looks at the request header to see which module to instantiate and then invokes your onData() method when a client sends over data.

Therefore the onData method is the one we need to implement in our module above. We’ll use the process object in Node to create a child process that emits an event called “output” each time the child sends data to stdout. So the addListener call sets up a callback that will be invoked when our file receives more data from the twitter stream. That data comes in the form of JSON objects, one per line. So we split on the lines to create an array of JSON objects, and loop through them. Each time through the loop we’re sending this data back to the client, which is the web browser.

Just make sure the file you pass in is the correct path to the file you are outputting to from the curl command in the monitor_file() function. Then to run the node.websocket.js server you can just invoke it like so

$ node runserver.js

However if you are running the server from another host, you may want to listen on more than just the default of localhost

$ node runserver.js --host=0.0.0.0

Setting Up the Client

The goal here is to get the realtime tweets pumping through our web browser so our client will just be a web page with a little bit of web socket javascript.

This is probably a bit more straightforward. We’re just implementing a few of the functions from the Websocket interface. First we’re instantiating the WebSocket class with the hostname and port that we’re running the server on. The gardenhose in the path is to tell our server that we want to run the gardenhose.js module that we wrote above. The onopen() function is invoked when the socket is opened, and we send over the word “start” which if you recall from above understands the client has connected and to start the child process that runs tail on our file. The onmessage() function is invoked anytime the server is sending data over the web socket, which is the information we want to show on the page, so we append it to the HTML of our hose div. If the server closes the socket then onclose() is invoked, and we display that on the page.

Conclusion

We’ve written a server side module using Node.js and the node.websocket.js framework that will send tweets over a web socket connection. We have also looked at the WebSocket API and learned how to implement some of the functions defined by its interface. Of course, you could use JQuery or your favorite javascript libraries to enhance how this looks to the user, but the basics are all here in how the communication of a real time display can work with web sockets.

References

Async I/O – http://en.wikipedia.org/wiki/Asynchronous_I/O

CommonJS – http://www.commonjs.org/ and http://wiki.commonjs.org/wiki/CommonJS

V8 – http://code.google.com/p/v8

Node.JS – http://nodejs.org

Node.Websocket.JS – http://github.com/guille/node.websocket.js

Twitter Streaming API (Gardenhose) – http://apiwiki.twitter.com/Streaming-API-Documentation

Websocket API – http://dev.w3.org/html5/websockets

Tags: , , , , , , , , , , ,




  • new england patriots 98.5
  • hp support 530
  • zara phillips kids
  • la ink ink
  • tubing
  • new england patriots 1996 roster
  • connecticut 30 news
  • princes
  • neptune
  • bea 460 bosch
  • juliana
  • mtv rivals
  • bea 00037
  • baer
  • hong
  • search lsu.edu
  • transfers
  • cspan journal
  • dist 95
  • getaway
  • chad ochocinco age
  • bea karp
  • zara phillips husband
  • randy moss jail
  • connecticut education
  • randy moss legal issues
  • search engines for jobs
  • waffle
  • fundamentals
  • zara phillips wedding date
  • chicago bears 17 lisa lampanelli
  • bea goldfishberg
  • search engines no follow
  • zara phillips wedding hat
  • breaks
  • vince young yahoo stats
  • chicago bears tickets
  • search protocol host
  • joshua
  • xanadu bengals
  • connecticut football
  • plated
  • rambler
  • search 2.0
  • bea 71 series staples
  • chad ochocinco ultimate catch cast
  • tea party birthday
  • randy moss college
  • mtv executivesmtv fantasy factory
  • chicago bears zip hoodie
  • bengals for adoption
  • hp support greece
  • bengals tryouts
  • hp support englandhp support forum
  • bengals forum
  • lists
  • bengals arrests
  • mesa
  • dis windsor wi
  • randy moss future
  • bea fox
  • battleship yamato wreck
  • battleship aurora
  • chicago bears rumors 2011
  • hp support id
  • painted
  • chad ochocinco quotes video
  • vince young 6
  • chicago bears football club
  • vince young redskins
  • conditions
  • chad ochocinco nascar
  • battleship classes
  • frequency
  • havelock
  • new england patriots wiki
  • connecticut 104.1
  • dis systems
  • connecticut lakes
  • chicago bears garter
  • beagle
  • skate
  • trusted
  • chad ochocinco quickstep
  • tea party nj
  • dangerous
  • hp support helpline
  • cspan government shutdown
  • dually
  • c span 4 to 5
  • bend
  • mtv jams
  • search engines rankings 2011
  • zara phillips facebookzara phillips gossip
  • new england patriots kim kardashian
  • la ink book an appointment
  • battleship layout
  • tea party agenda
  • new england patriots jake locker
  • randy moss mix
  • search with image
  • units
  • la ink map
  • input
  • dist 91
  • $200
  • mtv 2 schedule
  • hp support englandhp support forum
  • new england patriots 65
  • searchbugsearch engines
  • la ink season 6
  • new england patriots needs
  • connecticut 97.7connecticut attorney general
  • chad ochocinco yesterday
  • randy moss vikings 2011
  • dis 2012 conference
  • sperry
  • lease
  • chad ochocinco 15
  • foster
  • cameras
  • chicago bears gifts
  • zara phillips baby
  • la ink youtube pixie
  • connecticut 5 star resorts
  • mtv 5 cover
  • connecticut 5th district
  • search comcast net
  • gelatin
  • zara phillips tongue
  • teck
  • la ink cast
  • resistivity
  • bangles eternal flame mp3bengals forum
  • buss
  • protector
  • la ink season 5
  • vince young 3rd 30
  • connecticut department of labor
  • new england patriots helmet
  • connecticut limo
  • discjuggler
  • chicago bears pictures
  • mtv oddities
  • hp support error 1005
  • bea taylor
  • search engines zuula
  • cspan facebook
  • mtv music awards
  • vince young injury
  • search engines 9
  • battleship hacked
  • gregg olsen books
  • bengals job fair
  • dis poem
  • bengals 09 record
  • search 4
  • timbaland
  • freida pinto can't act
  • hp support hard drive replacement
  • login
  • dis x
  • marathon
  • dis pater
  • search chuck norris
  • nubian
  • vince young uncle rico
  • connecticut transit
  • chicago bears 4th phase
  • bea spells a lot
  • tea party obama
  • di's hallmark
  • battleship aurora
  • albuterol
  • c span shelby foote
  • chad ochocinco to patriots
  • seized
  • search xml file
  • vince young football camp
  • vince young released
  • chad ochocinco wedding date
  • tea party lies
  • bea 2011 map
  • connecticut renaissance faire
  • zara phillips and the queen
  • zara phillips dating
  • fond
  • lavigne
  • ronda
  • chad ochocinco free agent
  • vince young rumors
  • cspan presidents
  • spares
  • search operatorssearch people