Jan 09 2012

Distributed MySQL Sleuthing on the Wire

Category: Databases,Real-time Web,SSH,Systemsjgoulah @ 8:52 AM

Intro

Oftentimes you need to know what MySQL is doing right now and furthermore if you are handling heavy traffic you probably have multiple instances of it running across many nodes. I’m going to start by showing how to take a tcpdump capture on one node, a few ways to analyze that, and then go into how to take a distributed capture across many nodes for aggregate analysis.

Taking the Capture

The first thing you need to do is to take a capture of the interesting packets. You can either do this on the MySQL server or on the hosts talking to it. According to this percona post this command is the best way to capture mysql traffic on the eth0 interface and write it into mycapture.cap for later analysis:

% tcpdump -i eth0 -w mycapture.cap -s 0 "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
47542 packets captured
47703 packets received by filter
60 packets dropped by kernel

Analyzing the Capture

The next step is to take a look at your captured data. One way to do this is with tshark, which is the command line part of wireshark. You can do yum install wireshark or similar to install it. Usually you want to do this on a different host than the one taking traffic since it can be memory and CPU intensive.

You can then use it to reconstruct the mysql packets like so:

% tshark -d tcp.port==3306,mysql -T fields -R mysql.query -e frame.time -e ip.src -e ip.dst -e mysql.query -r mycapture.cap

This will give you the time, source IP, destination IP, and query but this is still really raw output. Its a nice start but we can do better. Percona has released the Percona Toolkit which includes some really nice command line tools (including what used to be in Maatkit).

The one we’re interested in here is pt-query-digest

It has tons of options and you should read the documentation, but here’s a few I’ve used recently.

Lets say you want to get the top tables queried from your tcpdump

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --order-by Query_time:cnt \
 --report-format profile --limit 5
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M   Item
# ==== ======== ============= ===== ====== ==== ===== ====================
#    1 0x        0.3140  6.1%   674 0.0005 1.00  0.00 shard.images
#    2 0x        0.8840 17.1%   499 0.0018 1.00  0.03 shard.activity
#    3 0x        0.1575  3.1%   266 0.0006 1.00  0.00 shard.listing_images
#    4 0x        0.1680  3.3%   265 0.0006 1.00  0.00 shard.connection_edges_reverse
#    5 0x        0.0598  1.2%   254 0.0002 1.00  0.00 shard.listing_translations
# MISC 0xMISC    3.5771 69.3%  3534 0.0010   NS   0.0 <86 ITEMS>

Note the tcpdump options I used this time, which the tool requires to work properly when passing –type tcpdump. I also grouped by tables (as opposed to full queries) and ordered by the count (the Calls column). It will stop at your –limit and group the rest into MISC so be aware of that.

You can remove the –order-by to sort by response time, which is the default sort order, or provide other attributes to sort on. We can also change the –report-format, for example to header:

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --report-format header 
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Overall: 5.49k total, 91 unique, 321.13 QPS, 0.30x concurrency _________
# Time range: 2012-01-08 15:52:05.814608 to 15:52:22.916873
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             5s     3us   114ms   939us     2ms     3ms   348us
# Rows affecte         316       0      13    0.06    0.99    0.29       0
# Query size         3.64M      18   5.65k  694.98   1.09k  386.68  592.07
# Warning coun           0       0       0       0       0       0       0
# Boolean:
# No index use   0% yes,  99% no

If you set the –report-format to query_report you will get gobs of verbose information that you can dive into and you can use the –filter option to do things like getting slow queries:

% tcpdump -r mycapture.cap -n -x -q -tttt | \
  pt-query-digest --type tcpdump --filter '($event->{No_index_used} eq "Yes" || $event->{No_good_index_used} eq "Yes")'

Distributed Capture

Now that we’ve taken a look at capturing and analyzing packets from one host, its time to dive into looking at our results across the cluster. The main trick is that tcpdump provides no option to stop capturing – you have to explicitly kill it. Otherwise we’ll just use dsh to send our commands out. We’ll assume you have a user that can hop around in a password-less fashion using ssh keys – setting that up is well outside the scope of this article but there’s plenty of info out there on how to do that.

There’s a few ways you can let a process run on a “timeout” but I’m assuming we don’t have any script written or tools like bash timeout or the one distributed in coreutils available.

So we’re going off the premise that you will background the process and kill it after a sleep by grabbing its pid:

( /path/to/command with options ) & sleep 5 ; kill $!

Simple enough, except we’ll want to capture the output on each host, so we need to ssh the output back over to the target using a pipe to grab the stdout. This means that $! will return the pid of our ssh command instead of our tcpdump command. We end up having to do a little trick to kill the right process, since the capture won’t be readable if we kill ssh command that is writing the output. We’ll need to kill tcpdump and to do that we can look at the parent pid of the ssh process, ask pkill (similar to pgrep) for all of the processes that have this parent, and finally kill the oldest one, which ends up being our tcpdump process.

Then end result looks like this if I were to run it across two machines:

% dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

So this issues a dsh to two of our hosts (you can make a dsh group with 100 or 1000 hosts though) and runs the command concurrently on each (-c). We issue our tcpdump on each target machine and send the output to stdout for ssh to then cat back to a directory on the source machine that issued the dsh. This way we have all of our captures in one directory with each file named with the target name of each host the tcpdump was run. The sleep is how long the dump is going to run for before we then kill off the tcpdump.

The last piece of the puzzle is to get these all into one file and we can use the mergecap tool for this, which is also part of wireshark:

% /usr/sbin/mergecap -F libpcap -w output.cap *.cap

And then we can analyze it like we did above.

Further Reading

References

http://www.mysqlperformanceblog.com/2011/04/18/how-to-use-tcpdump-on-very-busy-hosts

http://stackoverflow.com/questions/687948/timeout-a-command-in-bash-without-unnecessary-delay

http://www.xaprb.com/blog/2009/08/18/how-to-find-un-indexed-queries-in-mysql-without-using-the-log/

Breaking the distributed command down further

Just to clarify this command a bit more, particularly how the kill part works since that was the trickiest part for me to figure out.

When we run this

$ dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

on the server the process list looks something like

user     12505 12504  0 03:12 ?        00:00:00 bash -c sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | ssh myhost.myserver.com "cat - > /home/etsy/captures/$(hostname -a).cap" & sleep 5 ; sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")
pcap     12506 12505  1 03:12 ?        00:00:00 /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2
user     12507 12505  0 03:12 ?        00:00:00 ssh myhost.myserver.com cat - > ~/captures/web1001.cap

So $! is going to return the pid of the ssh process, 12507. We use awk to find the process matching that, and then print the parent pid out, which is then passed to the -P arg of pkill. If you use pgrep to look at this without the -o you’d get a list of the children of 12505, which are 12506 and 12507. The oldest child is the tcpdump command and so adding -o kills that guy off.

So if we were only running the command on one host we could use something much simpler

ssh dbhost01 '(sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 port 3306) & sleep 10; sudo kill $!' | cat - > output.cap

Tags: , , , , ,


Oct 31 2009

Setting up Gitosis

Category: Version Controljgoulah @ 4:33 PM

Overview

This article is part one of a two part series that covers setting up a hosting server using gitosis for your central repository, and in the next article, taking an existing SVN repository and running the appropriate scripts and commands necessary to migrate it into something git can work with.

So this article is how to setup and manage a git repository.There are some great services out there than can do this for you, but why pay money for something you can easily do for free? This article shows how to setup and manage a secure and private git repository that people can use as a central sharing point.

Setting Up Gitosis

Gitosis is a tool for hosting git repositories. Its common usage is for a central repository that other developers can push changes to for sharing.

First clone the gitosis repository and run the basic python install. You just need the python setuptools package

sudo apt-get install python-setuptools

And then you can easily install it

git clone git://eagain.net/gitosis.git
cd gitosis
sudo python setup.py install

Next you need to create a user that will own the repositories you want to manage. You can put its home directory wherever you want, but in this example we’ll put it in the standard /home location.

sudo adduser \
    --system \
    --shell /bin/sh \
    --gecos 'git version control' \
    --group \
    --disabled-password \
    --home /home/git \
    git

Then you must create an ssh public key (or use your existing one) for your first repository user. We’ll use an init command to copy it to server and load it. If you don’t have a public key you can create one with ssh-keygen like so

ssh-keygen -t dsa

Then gitosis-init is for the first time only, loads up your users key, and goes like this

sudo -H -u git gitosis-init < ~/.ssh/id_dsa.pub

Here it doesn't hurt to make sure your post-update hook has execute permissions.

sudo chmod 755 /home/git/repositories/gitosis-admin.git/hooks/post-update

Now you can clone the gitosis-admin repository, which is used to manage our repository permissions.

git clone git@YOUR_SERVER_HOSTNAME:gitosis-admin.git
cd gitosis-admin

Now you can see you have a gitosis.conf file and a keydir directory

$ ls -l
total 8
-rw-r--r-- 1 jgoulah mygroup   83 2009-10-31 20:44 gitosis.conf
drwxr-xr-x 2 jgoulah mygroup 4096 2009-10-31 20:44 keydir

The gitosis.conf file holds group and permission information for your repositories, and the keydir folder holds your public keys.

If I look in there I see my public key was imported from our earlier gitosis-init command

$ ls -l keydir/
total 4
-rw-r--r-- 1 jgoulah mygroup 603 2009-10-31 20:44 jgoulah.pub

So open up gitosis.conf and you should already see you have an entry for the gitosis-admin repository that we just cloned. The gitosis-init command above setup the access for us. From now on we can just crack open gitosis.conf and edit the permissions, commit and push back to our central repository.

If I wanted to create a new project for a repository called pizza_maker it would look something like this.

[group myteam]
members = jgoulah
writable = pizza_maker

Don't forget the members section is the name of your public key file without the .pub at the end. If your key was named XYZ.pub then your member line would have XYC here.

git commit -a -m "Create new repo permissions for pizza_maker project"
git push

As a reminder the second part of this series will show an svn to git import. For now lets assume we are starting from scratch. We'd create our project like this

cd && mkdir pizza_maker
cd pizza_maker
git init
git remote add origin git@YOUR_SERVER_HOSTNAME:pizza_maker.git
git add *
git commit -m "some stuff"
git push origin master:refs/heads/master

The only other thing to know is if you want to grant another user access to your repository. All you have to do is add their public key to the keydir folder, and then give the user permissions by modifying gitosis.conf

cd gitosis-admin
cp ~/otherdude.pub keydir/
 [group myteam]
- members = jgoulah
+ members = jgoulah otherdude
  writable = pizza_maker

If you need to, you can also grant public access over the git:// protocol like so

sudo -u git git-daemon --base-path=/home/git/repositories/ --export-all

Then someone can clone like

git clone git://YOUR_SERVER_HOSTNAME/pizza_maker.git

Conclusion

This article showed how to setup gitosis, how to initialize your gitosis-admin repository, which is a unique concept in itself to use a repository to manage repositories, and it works rather well. We also went over how to create our own new git repository, and how to manage the access permissions through gitosis.conf. Part two of this series will explain how to port from your current SVN setup to a Git setup. This article was a prerequisite if you want to host your own private repository when you're converting from SVN to Git, and thats what we'll look at next time.

Tags: , , , , , ,


Jan 11 2009

Intelligent Version Control and Branching

Category: Version Controljgoulah @ 12:28 PM

Overview

Most shops these days seem to know that using a Version Control System is necessary for the organization of a large software project involving multiple developers.   It’s essential to allow each developer to work on their part of the project and commit the changes to a central repository, which the rest of the developers can then pull down into their working sandbox.  Its an effective way to develop that avoids overwriting of changes, and allows for easy rollback and history.  However, one problem I see over and over again is either lack of branching altogether,  or doing it in a manual way that drifts the branch further out of sync with trunk as time goes on, and then trying to manually merge this with the trunk code when the branch is done.   For those who don’t know, trunk is basically your mainline, in which code can be branched off of for features that won’t interfere with trunk until they are merged at some later point.  This allows for testing changes without bothering other developers, but still being able to keep a commit history.

The difficult thing about this branching process is that if you try to manually merge code that has been on a branch for some time, the continuous development on trunk will have brought the two to such different places that you end up trying to resolve a ton of conflicts by hand.  Conflicts happen when changes are made to the same file and same lines in that file by different people.  Trunk is eventually a different bunch of code than when the branch was started, so by the end of your branch you could be developing against what is essentially a different codebase.  Anyone that has tried to do merging manually knows how painful this process can be.

Luckily, tools exist to help us with the process.  This article is going to focus on two tools that can interact with an SVN repository to help us with intelligent branching.   The tools I’m going to review are SVK and Git, and I will go over each separately. In fact they should not be used simultaneously since Git doesn’t know anything about the metadata SVK expects to be in your SVN props. In other words do not attempt to create a branch with SVK and then merge with Git, it just won’t work. Git by the way, can be used standalone, without SVN,  but I’m going to show you how to use it with SVN because that seems to be the version control system of choice in most shops these days. SVK is just a wrapper around SVN, its not a standalone tool. The great thing about both of these is you don’t necessarily have to migrate all of your developers over at once.  They can continue day to day work flow while you harness the power of branching.  Eventually people will see the advantage and try it out too.  Under the covers you are still storing your code in an SVN repository.

Using SVK

Installing SVK from source is well beyond the scope of this article, so I’m assuming you have a package management system that will handle this installation for you.  Once you have SVK installed, using it is very straightforward for SVN users.  In fact, its mostly the same exact commands, with a few additions.  Like SVN, you can do ‘svk help <command>’  to get detailed help on a particular command.

Checking Out Your Repository

Checking out your repository is very similar to SVN, and is done with the ‘checkout’ command:

svk co {repo}

where {repo} is your repository url (eg. http://svn.yourdomain.com/svn/repo)

You’ll be prompted for a base URI to mirror from and its ok to press enter here:

New URI encountered: http://svn.yourdomain.com/svn/repo
Choose a base URI to mirror from (press enter to use the full URI): ENTER

You’ll be prompted for a depot path and its also ok to accept the default

Depot path: [//mirror/repo] ENTER

SVK is a decentralized version control system and thus needs to mirror your SVN repository locally. You’ll see a prompt like this and you’ll be ok to accept the default:

Synchronizing the mirror for the first time:
a : Retrieve all revisions (default)
h : Only the most recent revision
-count : At most 'count' recent revisions
revision : Start from the specified revision
a)ll, h)ead, -count, revision? [a] ENTER

This step can take quite a while, depending on how large your repository is. Just be patient and it will finish up eventually.

Working with SVK

From here SVK is very similar to SVN. You’ll have checked out the entire repository, so if your SVN is setup according to standard you will see a trunk, branches, and tags directory under your repository root folder.

SVK has the same command set with one caveat, that when you update you have to give the -s flag:

svk up -s

This allows SVK to sync the mirror with the repository before updating your code.

Branching With SVK

The point of this article is how to use these tools for branching, so lets create a branch. There are a few ways to do this but a nice way is to copy trunk to a branch on the mirror itself:

svk cp //mirror/{repo}/trunk //mirror/{repo}/branches/my_branch_name
cd {repo}/branches
svk up -s # pull the new branch back from the mirror

Now you can work, work, work, and commit as you please. Here is where SVK really shines. You can pull from the repository path you branched from at any time. In this case we branched from trunk, so we can pull trunks changes into our branch.

svk pull

SVK will track all of the metadata surrounding the change, and if there happens to be a conflict it is resolved at pull time. Now the trunk code is merged into your branch. You are in sync with trunk, in addition to your branches changes.

The last thing to know is how to push your branch back to trunk when you are done. It doesn’t hurt to do one last pull to make sure you are in sync with trunk, and then:

svk push -l

Now your branch has been merged back to trunk. That is it! Painless and easy. You’ve already resolved the conflicts at pull time which SVK has tracked, so the push is simple. SVK has done all the hard work for us.

Its a good habit to get rid of your branch once you are done:

rm -rf branches/branch_name
svk rm branches/branch_name
svk commit -m "deleted old branch" branches/branch_name

Using Git

Using Git with SVN is a similar process, and like SVK, Git is also a distributed system. In a nutshell this means we can have local branches and remote branches as well. Its important to understand this concept when using Git much more so than when using SVK.  Its probably a good idea to go over the SVN to Git tutorial to see how familiar SVN commands map to Git.

Checking Out Your Repository

With Git you essentially clone your SVN repository. Here I’m using a repository called testrepo located on the same host so I am using the file protocol. You can use whichever protocol you normally use to access your SVN repository.

git svn clone --prefix=svn/ file:///usr/local/svn/testrepo -T trunk -b branches -t tags

You don’t have to use the –prefix here, but it allows you to be able to distinguish between local branches and those representing remote svn branches easily, so I highly recommend it.

Viewing Local and Remote Branches

With Git its important to remember that we have both local and remote branches. Local branches only exist for our sandbox, while remote branch are shared out on the SVN repository for others to use.

You can view your local branches like so:

$ git branch
* master

So far we only have the default master branch. When git svn is finished importing all of the svn history it sets up a master branch for you to work with. The tip of that branch will be the latest commit in the svn repository (note this is not necessarily trunk). The star indicates the current branch that we are on.

Its not a bad idea to have master correspond to trunk, and to be sure we can run:

git reset --hard svn/trunk

You can also view your remote branches. These exist out on the SVN repository we’ve cloned:

$ git branch -r
svn/testbug
svn/trunk

So far I can see that a trunk exists, as well as a branch called testbug. But these are remote, so to work with them we need to pull them down locally. Right now I’m interested in trunk, so we’ll grab that:

$ git checkout -b trunk svn/trunk
Switched to a new branch "trunk"

Anytime you would like to update the code with changes from the repository, you can do it like so:

git svn rebase

Branching With Git

Now lets assume we want to create a new branch called foo. This command will actually create the new branch on the SVN repository.

$ git svn branch foo

Since the branch is still only remote, we need to get a local copy:

$ git checkout -b foo svn/foo

Now lets actually do something on the branch, so we can show how merge operations work. We’ll just create a simple file:

$ touch myfile
$ echo "stuff" > myfile
$ cat myfile

And commit that to the local repository:

$ git add myfile
$ git commit -m "my new file"

Note this only commits to the local repository and other users on the branch won’t see the change until we commit it to the remote repository. Its a good idea to do a dry run to make sure we are committing to the right place. We can use ‘-n’ for dry run:

$ git svn dcommit -n
Committing to file:///usr/local/svn/testrepo/branches/foo ...

This looks correct, its going to commit to branches/foo on our remote repository, so run the command without the -n option.

$ git svn dcommit

Now lets switch back to trunk so we can perform a merge.

$ git checkout trunk

Now that we are on trunk, our new file doesn’t exist, which makes sense since it was created on the branch. But we want to merge our branch foo back into trunk, which can be done like so:

$ git merge --no-ff foo

And again we want to commit that local change to the remote repository:

$ git svn dcommit

There’s one more important thing to note. Normally you don’t need –no-ff unless you are using git against svn like this. –no-ff is needed so you get a real merge commit. If you don’t have that and a fast-forward is possible you’ll end up with the commit from the svn branch on top of the history, in which case the branch that commit on top is part of will be used for committing.

Therefore its a good idea to add this option to the default merge options for anything maintained with git-svn. Open up the .git/config and add the following:

branch.foo.mergeoptions = --no-ff

This allows you to use git merge and not remember the –no-ff option.

Conclusion

We’ve seen how to use two distributed revision control systems to help us manage our remote SVN repositories. We know how to create a branch, and stay in sync with the development on trunk. We have also learned now to merge our changes from the branch back into trunk in a very painless and easy way.

Tags: , , , ,