Jan 09 2012

Distributed MySQL Sleuthing on the Wire

Category: Databases,Real-time Web,SSH,Systemsjgoulah @ 8:52 AM

Intro

Oftentimes you need to know what MySQL is doing right now and furthermore if you are handling heavy traffic you probably have multiple instances of it running across many nodes. I’m going to start by showing how to take a tcpdump capture on one node, a few ways to analyze that, and then go into how to take a distributed capture across many nodes for aggregate analysis.

Taking the Capture

The first thing you need to do is to take a capture of the interesting packets. You can either do this on the MySQL server or on the hosts talking to it. According to this percona post this command is the best way to capture mysql traffic on the eth0 interface and write it into mycapture.cap for later analysis:

% tcpdump -i eth0 -w mycapture.cap -s 0 "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
47542 packets captured
47703 packets received by filter
60 packets dropped by kernel

Analyzing the Capture

The next step is to take a look at your captured data. One way to do this is with tshark, which is the command line part of wireshark. You can do yum install wireshark or similar to install it. Usually you want to do this on a different host than the one taking traffic since it can be memory and CPU intensive.

You can then use it to reconstruct the mysql packets like so:

% tshark -d tcp.port==3306,mysql -T fields -R mysql.query -e frame.time -e ip.src -e ip.dst -e mysql.query -r mycapture.cap

This will give you the time, source IP, destination IP, and query but this is still really raw output. Its a nice start but we can do better. Percona has released the Percona Toolkit which includes some really nice command line tools (including what used to be in Maatkit).

The one we’re interested in here is pt-query-digest

It has tons of options and you should read the documentation, but here’s a few I’ve used recently.

Lets say you want to get the top tables queried from your tcpdump

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --order-by Query_time:cnt \
 --report-format profile --limit 5
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M   Item
# ==== ======== ============= ===== ====== ==== ===== ====================
#    1 0x        0.3140  6.1%   674 0.0005 1.00  0.00 shard.images
#    2 0x        0.8840 17.1%   499 0.0018 1.00  0.03 shard.activity
#    3 0x        0.1575  3.1%   266 0.0006 1.00  0.00 shard.listing_images
#    4 0x        0.1680  3.3%   265 0.0006 1.00  0.00 shard.connection_edges_reverse
#    5 0x        0.0598  1.2%   254 0.0002 1.00  0.00 shard.listing_translations
# MISC 0xMISC    3.5771 69.3%  3534 0.0010   NS   0.0 <86 ITEMS>

Note the tcpdump options I used this time, which the tool requires to work properly when passing –type tcpdump. I also grouped by tables (as opposed to full queries) and ordered by the count (the Calls column). It will stop at your –limit and group the rest into MISC so be aware of that.

You can remove the –order-by to sort by response time, which is the default sort order, or provide other attributes to sort on. We can also change the –report-format, for example to header:

% tcpdump -r mycapture.cap -n -x -q -tttt | pt-query-digest --type tcpdump --group-by tables --report-format header
reading from file mycapture.cap, link-type EN10MB (Ethernet)

# Overall: 5.49k total, 91 unique, 321.13 QPS, 0.30x concurrency _________
# Time range: 2012-01-08 15:52:05.814608 to 15:52:22.916873
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             5s     3us   114ms   939us     2ms     3ms   348us
# Rows affecte         316       0      13    0.06    0.99    0.29       0
# Query size         3.64M      18   5.65k  694.98   1.09k  386.68  592.07
# Warning coun           0       0       0       0       0       0       0
# Boolean:
# No index use   0% yes,  99% no

If you set the –report-format to query_report you will get gobs of verbose information that you can dive into and you can use the –filter option to do things like getting slow queries:

% tcpdump -r mycapture.cap -n -x -q -tttt | \
  pt-query-digest --type tcpdump --filter '($event->{No_index_used} eq "Yes" || $event->{No_good_index_used} eq "Yes")'

Distributed Capture

Now that we’ve taken a look at capturing and analyzing packets from one host, its time to dive into looking at our results across the cluster. The main trick is that tcpdump provides no option to stop capturing – you have to explicitly kill it. Otherwise we’ll just use dsh to send our commands out. We’ll assume you have a user that can hop around in a password-less fashion using ssh keys – setting that up is well outside the scope of this article but there’s plenty of info out there on how to do that.

There’s a few ways you can let a process run on a “timeout” but I’m assuming we don’t have any script written or tools like bash timeout or the one distributed in coreutils available.

So we’re going off the premise that you will background the process and kill it after a sleep by grabbing its pid:

( /path/to/command with options ) & sleep 5 ; kill $!

Simple enough, except we’ll want to capture the output on each host, so we need to ssh the output back over to the target using a pipe to grab the stdout. This means that $! will return the pid of our ssh command instead of our tcpdump command. We end up having to do a little trick to kill the right process, since the capture won’t be readable if we kill ssh command that is writing the output. We’ll need to kill tcpdump and to do that we can look at the parent pid of the ssh process, ask pkill (similar to pgrep) for all of the processes that have this parent, and finally kill the oldest one, which ends up being our tcpdump process.

Then end result looks like this if I were to run it across two machines:

% dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

So this issues a dsh to two of our hosts (you can make a dsh group with 100 or 1000 hosts though) and runs the command concurrently on each (-c). We issue our tcpdump on each target machine and send the output to stdout for ssh to then cat back to a directory on the source machine that issued the dsh. This way we have all of our captures in one directory with each file named with the target name of each host the tcpdump was run. The sleep is how long the dump is going to run for before we then kill off the tcpdump.

The last piece of the puzzle is to get these all into one file and we can use the mergecap tool for this, which is also part of wireshark:

% /usr/sbin/mergecap -F libpcap -w output.cap *.cap

And then we can analyze it like we did above.

Further Reading

References

http://www.mysqlperformanceblog.com/2011/04/18/how-to-use-tcpdump-on-very-busy-hosts

http://stackoverflow.com/questions/687948/timeout-a-command-in-bash-without-unnecessary-delay

http://www.xaprb.com/blog/2009/08/18/how-to-find-un-indexed-queries-in-mysql-without-using-the-log/

Breaking the distributed command down further

Just to clarify this command a bit more, particularly how the kill part works since that was the trickiest part for me to figure out.

When we run this

$ dsh -c -m web1000,web1001 \
   'sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | \
   ssh dshhost "cat - > ~/captures/$(hostname -a).cap" & sleep 10 ; \
   sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")'

on the server the process list looks something like

user     12505 12504  0 03:12 ?        00:00:00 bash -c sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" | ssh myhost.myserver.com "cat - > /home/etsy/captures/$(hostname -a).cap" & sleep 5 ; sudo pkill -o -P $(ps -ef | awk "\$2 ~ /\<$!\>/ { print \$3; }")
pcap     12506 12505  1 03:12 ?        00:00:00 /usr/sbin/tcpdump -i eth0 -w - -s 0 -x -n -q -tttt port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2
user     12507 12505  0 03:12 ?        00:00:00 ssh myhost.myserver.com cat - > ~/captures/web1001.cap

So $! is going to return the pid of the ssh process, 12507. We use awk to find the process matching that, and then print the parent pid out, which is then passed to the -P arg of pkill. If you use pgrep to look at this without the -o you’d get a list of the children of 12505, which are 12506 and 12507. The oldest child is the tcpdump command and so adding -o kills that guy off.

So if we were only running the command on one host we could use something much simpler

ssh dbhost01 '(sudo /usr/sbin/tcpdump -i eth0 -w - -s 0 port 3306) & sleep 10; sudo kill $!' | cat - > output.cap

Tags: , , , , ,


Oct 02 2010

Quick Tip: Dynamically Updating Screen Window Titles With The Current Server Hostname

Category: Organization,SSH,Systemsjgoulah @ 12:13 PM

I haven’t had a ton of time for blogging lately but figured this tip was good enough to throw out there for all the screen users. One way I like to organize servers that I’m ssh’d into is using screen windows. As you hopefully know you can use Ctrl-A c to create sub windows within screen. Then you can switch between them in several ways such as using Ctrl-A X where X is the window number, Ctrl-A n or Ctrl-A p for next and previous, and Ctrl-A “ to get a list of the windows for selection. You can move windows around with Ctrl-A : then type number X where X is the window you want to swap with. Finally, you can also name the windows with Ctrl-A A. So usually I ssh into a server, and manually change the window title to the server name I’m ssh’d into so that its easy to organize and remember where my windows are.

Turns out thats a lot of repetitive work that can easily be scripted. You can create a really simple script called ssh-to and place in your ~/bin or somewhere in your path

#!/bin/bash

hostname=`basename $0`

# set screen title to short hostname
echo -n -e "\033k${hostname%%.*}\033\134"

# ssh to the server with any additional args
ssh -X $hostname $*

# set hostname back when session is done (-s on osx)
echo -n -e "\033k`hostname -s`\033\134"

Now, create symbolic links to the script with each server hostname that you use. This can be a little tedious but you only have to do it once and the benefit is that you can now tab complete ssh’ing into servers.

For example you would do something like

ln -s ssh-to myserver.com

Then anytime you type “myserver.com” the tab completion can fill it out, you’re ssh’d into the server (assuming you have keys setup you don’t have to type a password) and your screen window title is updated with the domain info trimmed off, in this case myserver. Now your screen windows update their own titles anytime you ssh into a new server!

Tags: , , ,


Jan 18 2010

Using Mongo and Map Reduce on Apache Access Logs

Category: Databases,Systemsjgoulah @ 9:32 PM

Introduction

With more and more traffic pouring into websites, it has become necessary to come up with creative ways to parse and analyze large data sets. One of the popular ways to do that lately is using MapReduce which is a framework used across distributed systems to help make sense of large data sets. There are lots of implementations of the map/reduce framework but an easy way to get started is by using MongoDB. MongoDB is a scalable and high performant document oriented database. It has replication, sharding, and mapreduce all built in which makes it easy to scale horizontally.

For this article we’ll look a common use case of map/reduce which is to help analyze your apache logs. Since there is no set schema in a document oriented database, its a good fit for log files since its fairly easy to import arbitrary data. We’ll look at getting the data into a format that mongo can import, and writing a map/reduce algorithm to make sense of some of the data.

Getting Setup

Installing Mongo

Mongo is easy to install with detailed documentation here. In a nutshell you can do

$ mkdir -p /data/db
$ curl -O http://downloads.mongodb.org/osx/mongodb-osx-i386-latest.tgz
$ tar xzf mongodb-osx-i386-latest.tgz

At this point its not a bad idea to put this directory somewhere like /opt and adding its bin directory to your path. That way instead of

 ./mongodb-xxxxxxx/bin/mongod &

You can just do

mongodb &

In any case, start up the daemon one of those two ways depending how you set it up.

Importing the Log Files

Apache access files can vary in the information reported. The log format is easy to change with the LogFormat directive which is documented here. In any case these logs that I’m working with are not out of the box apache format. They look something like this

Jan 18 17:20:26 web4 logger: networks-www-v2 84.58.8.36 [18/Jan/2010:17:20:26 -0500] "GET /javascript/2010-01-07-15-46-41/97fec578b695157cbccf12bfd647dcfa.js HTTP/1.1" 200 33445 "http://www.channelfrederator.com/hangover/episode/HNG_20100101/cuddlesticks-cartoon-hangover-4" "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7" www.channelfrederator.com 35119

We want to take this raw log and convert it to a JSON structure for importing into mongo, which I wrote a simple perl script to iterate through the log and parse it into sensible fields using a regular expression

#!/usr/bin/perl

use strict;
use warnings;

my $logfile = "/logs/httpd/remote_www_access_log";
open(LOGFH, "$logfile");
foreach my $logentry (<LOGFH>) {
    chomp($logentry);   # remove the newline
    $logentry =~ m/(\w+\s\w+\s\w+:\w+:\w+)\s #date
                   (\w+)\slogger:\s # server host
                   ([^\s]+)\s # vhost logger
                   (?:unknown,\s)?(-|(?:\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3},?\s?)*)\s #ip
                   \[(.*)\]\s # date again
                   \"(.*?)\"\s # request
                   (\d+)\s #status
                   ([\d-]+)\s # bytes sent
                   \"(.*?)\"\s # referrer
                   \"(.*?)\"\s # user agent
                   ([\w.-]+)\s? # domain name
                   (\d+)? # time to server (ms)
                  /x;

    print <<JSON;
     {"date": "$5", "webserver": "$2", "logger": "$3", "ip": "$4", "request": "$6", "status": "$7", "bytes_sent": "$8", "referrer": "$9", "user_agent": "$10", "domain_name": "$11", "time_to_serve": "$12"}
JSON

}

Again my regular expression probably won’t quite work on your logs, though you may be able to take bits and pieces of what I’ve documented above to use for your script. On my logs that script outputs a bunch of lines that look like this

{"date": "18/Jan/2010:17:20:26 -0500", "webserver": "web4", "logger": "networks-www-v2", "ip": "84.58.8.36", "request": "GET /javascript/2010-01-07-15-46-41/97fec578b695157cbccf12bfd647dcfa.js HTTP/1.1", "status": "200", "bytes_sent": "33445", "referrer": "http://www.channelfrederator.com/hangover/episode/HNG_20100101/cuddlesticks-cartoon-hangover-4", "user_agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7", "domain_name": "www.channelfrederator.com", "time_to_serve": "35119"}

And we can then import that directly to MongoDB. The creates a collection called weblogs in the logs database, from the file that has the output of the above JSON generator script

$ mongoimport --type json -d logs -c weblogs --file weblogs.json

We can also take a look at them and verify they loaded by running the find command, which dumps out 10 rows by default

$ mongo
> use logs;
switched to db logs
> db.weblogs.find()

Setting Up the Map and Reduce Functions

So for this example what I am looking for is how many hits are going to each domain. My servers handle a bunch of different domain names and that is one thing I’m outputting into my logs that is easy to examine

The basic command line interface to MondoDB is a kind of javascript interpreter, and the mongo backend takes javascript implementations of the map and reduce functions. We can type these directly into the mongo console. The map function must emit a key/value pair and in this example it will output a ’1′ each time a domain is found

> map = "function() { emit(this.domain_name, {count: 1}); }"

And so basically what comes out of this is a key for each domain with a set of counts, something like

 {"www.something.com", [{count: 1}, {count: 1}, {count: 1}, {count: 1}]}

This is send to the reduce function, which sums up all those counts for each domain

> reduce = "function(key, values) { var sum = 0; values.forEach(function(f) { sum += f.count; }); return {count: sum}; };"

Now that you’ve setup map and reduce functions, you can call mapreduce on the collection

> results = db.weblogs.mapReduce(map, reduce)
...
>results
{
        "result" : "tmp.mr.mapreduce_1263861252_3",
        "timeMillis" : 9034,
        "counts" : {
                "input" : 1159355,
                "emit" : 1159355,
                "output" : 92
        },
        "ok" : 1,
}

This gives us a bit of information about the map/reduce operation itself. We see that we inputted a set of data and emmitted once per item in that set. And we reduced down to 92 domain with a count for each. A result collection is given, and we can print it out

> db.tmp.mr.mapreduce_1263861252_3.find()
> it

You can type the ‘it’ operator to page through multiple pages of data. Or you can print it all at once like so

> db.tmp.mr.mapreduce_1263861252_3.find().forEach( function(x) { print(tojson(x));});
{ "_id" : "barelydigital.com", "value" : { "count" : 342888 } }
{ "_id" : "barelypolitical.com", "value" : { "count" : 875217 } }
{ "_id" : "www.fastlanedaily.com", "value" : { "count" : 998360 } }
{ "_id" : "www.threadbanger.com", "value" : { "count" : 331937 } }

Nice, we have our data aggregated and the answer to our initial problem.

Automate With a Perl Script

As usual CPAN comes to the rescue with a MongoDB driver that we can use to interface with our database in a scripted fashion. The mongo guys have also done a great job of supporting it in a bunch of other languages which makes it easy to interface with if you are using more than just perl

#!/bin/perl

use MongoDB;
use Data::Dumper;
use strict;
use warnings;

my $conn = MongoDB::Connection->new("host" => "localhost", "port" => 27017);
my $db = $conn->get_database("logs");

my $map = "function() { emit(this.domain_name, {count: 1}); }";

my $reduce = "function(key, values) { var sum = 0; values.forEach(function(f) { sum += f.count; }); return {count: sum}; }";

my $idx = Tie::IxHash->new(mapreduce => 'weblogs', 'map' => $map, reduce => $reduce);
my $result = $db->run_command($idx);

print Dumper($result);

my $res_coll = $result->{'result'};

print "result collection is $res_coll\n";

my $collection = $db->get_collection($res_coll);
my $cursor = $collection->query({ }, { limit => 10 });

while (my $object = $cursor->next) {
    print Dumper($object);
}

The script is straightforward. Its just using the documented perl interface to make a run_command call to mongo, which passes the map and reduce javascript functions in. It then prints the results similar to how we did on the command line earlier.

Summary

We’ve gone over a very simple example of how MapReduce can work for you. There are lots of other ways you can put it to good use such as distributed sorting and searching, document clustering, and machine learning. We also took a look at MongoDB which has great uses for schema-less data. It makes it easy to scale since it has built in replication and sharding capabilities. Now you can put map/reduce to work on your logs and find a ton of information you couldn’t easily get before.

References

http://github.com/mongodb/mongo
http://www.mongodb.org/display/DOCS/MapReduce
http://apirate.posterous.com/visualizing-log-files-with-mongodb-mapreduce
http://search.cpan.org/~kristina/MongoDB-0.27/
http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/

Tags: , , , , , , , , ,


Jun 11 2009

Investigating Data in Memcached

Category: Databases,Systemsjgoulah @ 9:58 PM

Intro

Almost every company I can think of uses Memcached at some layer of their stack, however I haven’t until recently found a great way to take a snapshot of the keys in memory and their associated metadata such as expire time, LRU time, the value size and whether its been expired or flushed. The tool to do this is called peep

Installing Peep

The main thing about installing peep is that you have to compile memcached with debugging symbols. Not really a big deal though. First thing you’ll want to do is install libevent

$ wget http://monkey.org/~provos/libevent-1.4.11-stable.tar.gz
$ tar xzvf libevent-1.4.11-stable.tar.gz
$ cd libevent-1.4.11-stable
$ ./configure
$ make
$ sudo make install

Now grab memcached

$ wget http://memcached.googlecode.com/files/memcached-1.2.8.tar.gz
$ tar xzvf memcached-1.2.8.tar.gz
$ cd memcached-1.2.8
$ CFLAGS='-g' ./configure --enable-threads
$ make
$ sudo make install

Note the configure line on this one sets the -g flag which enables debug symbols. Now move this directory somewhere standard like /usr/local/src because we’ll need to reference it when installing peep

$ sudo mv memcached-1.2.8 /usr/local/src

Ok, now for peep, which is written in ruby. If you don’t have rubygems you need that and the ruby headers. Just use your package management system for this. I happened to be on a CentOS box so I ran

$ sudo yum install rubygems.noarch
$ sudo yum install ruby-devel.i386

Finally you can install peep

$ sudo gem install peep -- --with-memcached-include=/usr/local/bin/memcached-1.2.8

Using Peep

Finally you can actually use this thing, you just give it the running memcached process pid

$ sudo peep --pretty `cat /var/run/memcached/memcached.pid`

Here’s a snippet of what peep can show us in its “pretty” mode

      time |   exptime |  nbytes | nsuffix | it_f | clsid | nkey |                           key | exprd | flushd
       485 |       785 |   17925 |      10 | link |    25 |   64 |  "post.ordered-posts1-03afdb" |  true |  false
       537 |       721 |   16991 |      10 | link |    24 |   63 |  "post.ordered-posts1-03bd6"  |  true |  false
       240 |      3684 |     434 |       8 | link |     9 |   22 |  "channel.channel.105.v1"     | false |  false
       241 |      3687 |   27286 |      10 | link |    27 |   55 |  "post.post_count-fec35129"   | false |  false
       538 |      4022 |    3223 |       9 | link |    17 |   55 |  "post.post_count-2ff57a7"    | false |  false
       538 |      4024 |   17169 |      10 | link |    25 |   55 |  "post.post_count-2ba928998d" | false |  false
       241 |      3686 |   10763 |      10 | link |    22 |   55 |  "post.post_count-3879a24011" | false |  false
        25 |       320 |    8874 |       9 | link |    22 |   22 |  "channel.posterframes.4"     |  true |  false

Putting the Data Into MySQL

The above is not the easiest thing to analyze, especially if you have a lot of data in cache. But we can easily load it in to MySQL so that we can run queries on it.

First create the db and permissions

mysql> create database peep;
mysql> grant all on peep.* to peep@'localhost' identified by 'peep4u';

Then a table to store the output of the peep snapshot

mysql> CREATE TABLE `entries` (
  `lru_time` int(11) DEFAULT NULL,
  `expires_at_time` int(11) DEFAULT NULL,
  `value_size` int(11) DEFAULT NULL,
  `suffix_size` int(11) DEFAULT NULL,
  `it_flag` varchar(255) DEFAULT NULL,
  `class_id` int(11) DEFAULT NULL,
  `key_size` int(11) DEFAULT NULL,
  `key_name` varchar(255) DEFAULT NULL,
  `is_expired` varchar(255) DEFAULT NULL,
  `is_flushed` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Now run peep and output the data in “ugly” format into a file

$ sudo peep --ugly `cat /var/run/memcached/memcached.pid` > peep.out

And now you can load it into your entries table

mysql> load data local infile '/tmp/peep.out' into table entries fields terminated by ' | ' lines terminated by '\n';
Query OK, 177 rows affected (0.01 sec)
Records: 177  Deleted: 0  Skipped: 0  Warnings: 0

I’m only loading a small dev install of memcahed but if you are running this on production you’d be importing many more rows. Luckily load data infile is sufficiently optimized to input large datasets. Somewhat unfortunately however, peep makes memcached block while its taking its snapshot, which can take a bit of time in production.

In any case, not that interesting with dev data but you can get lots of interesting numbers out of this. In my case it looks like most of the stuff in my cache is expired already.

mysql> select is_expired, count(*) as num, sum(value_size) as value_size from entries group by is_expired;
+------------+-----+------------+
| is_expired | num | value_size |
+------------+-----+------------+
| false      |   1 |        415 |
| true       | 176 |    2392792 |
+------------+-----+------------+
2 rows in set (0.01 sec)

Or selecting by grouping slabs and displaying their sizes

mysql> select class_id as slab_class, max(value_size) as slab_size from entries group by slab_class;
+------------+-----------+
| slab_class | slab_size |
+------------+-----------+
|          1 |         8 |
|          2 |        15 |
|          9 |       445 |
|         12 |      1100 |
|         13 |      1402 |
|         14 |      1710 |
|         15 |      2174 |
|         16 |      2409 |
|         17 |      3223 |
|         20 |      6154 |
|         21 |      8395 |
|         22 |     10763 |
|         23 |     13395 |
|         24 |     16991 |
|         25 |     17925 |
|         26 |     23320 |
|         27 |     33361 |
|         28 |     38824 |
|         29 |     44904 |
+------------+-----------+
19 rows in set (0.00 sec)

Conclusion

Although there are a lot of tools such as Cacti that will display graphs of your Memcached usage, there are not many tools that will actually show you specifics of whats in memory. This tool can be used for a variety of reasons, but its especially useful to example the keys that you have in memory and the size of their values, as well as if they’re expired and what the individuals slabs are holding.

Tags: , , , ,


Feb 22 2009

Attach Progress Bars to Everyday Linux Commands With Pipe Viewer

Category: Systemsjgoulah @ 9:45 PM

Introduction

A lot of Linux command line utilities don’t give any indication of progress in terms of when the job is going to finish. I’m going to take a couple of commands that I use on a near daily basis and show how to output some progress using the pipe viewer utility. This is a handy tool that you can insert between pipes in your shell commands that will give you a visual indication of how long its going to take for the command to finish.

Getting Started

There isn’t much to installing this tool with your favorite package manager, so just make sure you do that first using yum or apt.

The easiest way to show what pv does is by creating a simple example. We can simply gzip a file and show how much data is being processed and how long it will take to finish.

$ pv somefile | gzip > somefile.log.gz
64.1MB 0:00:03 [10.1MB/s] [==========>                        ] 32% ETA 0:00:06

So here can see we have processed 64.1MB in 3 seconds at a rate of about 10.1MB/s and should take about 6 seconds to finish.

Chaining PV to Display Different Inputs and Outputs

Another really cool thing is you can chain these pv commands to see the progress of data being read off the disk and then how fast gzip is compressing it.

$ pv -cN source_stream somefile | gzip | pv -cN gzip_stream > somefile.log.gz
source_stream: 89.6MB 0:00:06 [10.1MB/s] [============>       ] 45% ETA 0:00:07
gzip_stream:   14.8MB 0:00:06 [ 3.2MB/s] [       <=>          ]

In this example we created named streams with -N which you can call whatever you want. In this example they are called source_stream and gzip_stream since that’s what they are measuring. The -c parameter is recommended when running multiple pv processes in the same pipeline and prevents the processes from mangling each others output.

Its important also to point out here that we get a percentage on the source_stream because pv knows how much data exists in that file, but the second stream doesn’t know how much data will pass through gzip and so it can only display how much data has passed through so far.

Using PV on Other Commands

Another example is using tar to compress a folder, but if we do something like this:

$ tar cjf - some_directory | pv > out.tar.bz2
3.55MB 0:00:02 [ 1.2MB/s] [   <=>          ]

the output isn’t very interesting because pv doesn’t know the total amount of data its compressing. We can provide the size with the -s option to pv, and a little bash interpretation to grab the size

tar -cf - some_directory | \
     pv -s $(du -sb some_directory | awk '{print $1}') | \
     bzip2 > somedir.tar.bz2
1.39MB 0:00:03 [ 929kB/s] [==>     ]  6% ETA 0:00:45

So we are telling tar to create an archive of some_directory, print the output to stdout, next tell pv the size of that directory using the du command on it, and send that to bzip2 for compression.

Create a Shell Script For Easy Reuse

At this point the command above gets to be way too much to remember. We’ve also had to type the directory name twice which is silly. I always put my ~/bin directory in my PATH so that I can throw easy scripts in there. Here’s one that will take a directory argument and compress it with progress

#!/bin/sh

if [ -z $1 ]; then
    echo "Usage: $0 dir"
    exit 1
fi

dir="${1%/*}"

if [ ! -e $dir ]; then
  echo "$dir doesn't exist"
  exit 1
fi

echo "compressing $dir"
tar -cf - $dir | pv -s $(du -sb $dir | awk '{print $1}') | bzip2 > $dir.tar.bz2

So we’ve just taken what we’ve learned above and created a script that takes an argument, which needs to be a directory (or file) that exists, and the output is that directory name with a tar.bz2 extension.

Seeing Progress In a MySQL Import

We can use the same exact concept to chain our tools to get a progress bar on a MySQL data import. Here it is

pv myschema.sql | \
      pv -s $(du -sb myschema.sql | awk '{print $1}') | \
      mysql -u db_user my_database

As shown above we can easily turn this into a script that can be reused.

Conclusion

We’ve seen how to take a simple tool called pipe viewer which gives information on the data fed to it and attached it to some commands we currently use. Since the commands can be tricky to remember we’ve seen how to wrap them in a simple shell script. Now you can experiment with using it on other commands that you frequently use and want to see progress output, which can be a really useful thing when dealing with large amounts of data.

Tags: ,




  • new england patriots 98.5
  • hp support 530
  • zara phillips kids
  • la ink ink
  • tubing
  • new england patriots 1996 roster
  • connecticut 30 news
  • princes
  • neptune
  • bea 460 bosch
  • juliana
  • mtv rivals
  • bea 00037
  • baer
  • hong
  • search lsu.edu
  • transfers
  • cspan journal
  • dist 95
  • getaway
  • chad ochocinco age
  • bea karp
  • zara phillips husband
  • randy moss jail
  • connecticut education
  • randy moss legal issues
  • search engines for jobs
  • waffle
  • fundamentals
  • zara phillips wedding date
  • chicago bears 17 lisa lampanelli
  • bea goldfishberg
  • search engines no follow
  • zara phillips wedding hat
  • breaks
  • vince young yahoo stats
  • chicago bears tickets
  • search protocol host
  • joshua
  • xanadu bengals
  • connecticut football
  • plated
  • rambler
  • search 2.0
  • bea 71 series staples
  • chad ochocinco ultimate catch cast
  • tea party birthday
  • randy moss college
  • mtv executivesmtv fantasy factory
  • chicago bears zip hoodie
  • bengals for adoption
  • hp support greece
  • bengals tryouts
  • hp support englandhp support forum
  • bengals forum
  • lists
  • bengals arrests
  • mesa
  • dis windsor wi
  • randy moss future
  • bea fox
  • battleship yamato wreck
  • battleship aurora
  • chicago bears rumors 2011
  • hp support id
  • painted
  • chad ochocinco quotes video
  • vince young 6
  • chicago bears football club
  • vince young redskins
  • conditions
  • chad ochocinco nascar
  • battleship classes
  • frequency
  • havelock
  • new england patriots wiki
  • connecticut 104.1
  • dis systems
  • connecticut lakes
  • chicago bears garter
  • beagle
  • skate
  • trusted
  • chad ochocinco quickstep
  • tea party nj
  • dangerous
  • hp support helpline
  • cspan government shutdown
  • dually
  • c span 4 to 5
  • bend
  • mtv jams
  • search engines rankings 2011
  • zara phillips facebookzara phillips gossip
  • new england patriots kim kardashian
  • la ink book an appointment
  • battleship layout
  • tea party agenda
  • new england patriots jake locker
  • randy moss mix
  • search with image
  • units
  • la ink map
  • input
  • dist 91
  • $200
  • mtv 2 schedule
  • hp support englandhp support forum
  • new england patriots 65
  • searchbugsearch engines
  • la ink season 6
  • new england patriots needs
  • connecticut 97.7connecticut attorney general
  • chad ochocinco yesterday
  • randy moss vikings 2011
  • dis 2012 conference
  • sperry
  • lease
  • chad ochocinco 15
  • foster
  • cameras
  • chicago bears gifts
  • zara phillips baby
  • la ink youtube pixie
  • connecticut 5 star resorts
  • mtv 5 cover
  • connecticut 5th district
  • search comcast net
  • gelatin
  • zara phillips tongue
  • teck
  • la ink cast
  • resistivity
  • bangles eternal flame mp3bengals forum
  • buss
  • protector
  • la ink season 5
  • vince young 3rd 30
  • connecticut department of labor
  • new england patriots helmet
  • connecticut limo
  • discjuggler
  • chicago bears pictures
  • mtv oddities
  • hp support error 1005
  • bea taylor
  • search engines zuula
  • cspan facebook
  • mtv music awards
  • vince young injury
  • search engines 9
  • battleship hacked
  • gregg olsen books
  • bengals job fair
  • dis poem
  • bengals 09 record
  • search 4
  • timbaland
  • freida pinto can't act
  • hp support hard drive replacement
  • login
  • dis x
  • marathon
  • dis pater
  • search chuck norris
  • nubian
  • vince young uncle rico
  • connecticut transit
  • chicago bears 4th phase
  • bea spells a lot
  • tea party obama
  • di's hallmark
  • battleship aurora
  • albuterol
  • c span shelby foote
  • chad ochocinco to patriots
  • seized
  • search xml file
  • vince young football camp
  • vince young released
  • chad ochocinco wedding date
  • tea party lies
  • bea 2011 map
  • connecticut renaissance faire
  • zara phillips and the queen
  • zara phillips dating
  • fond
  • lavigne
  • ronda
  • chad ochocinco free agent
  • vince young rumors
  • cspan presidents
  • spares
  • search operatorssearch people