May 01 2010

Quick Tip: Gist From the Command Line

Category: Organizationjgoulah @ 3:56 PM

Many a time while working it is convenient to quickly show someone else code or a chunk of output from some command. The easiest way to do this is through a pastebin service. It’s also pretty much mandated on IRC to use a service like this and there are literally thousands out there to choose from.

The gist service on github is actually fairly convenient especially if you have a github account since it keeps an archive of all of your previous gists. It also does a good job formatting and lets you paste privately. We’ll use the App::Nopaste tool to paste a gist straight from the command line.

First off install the tool from cpan

$ cpan App::Nopaste

You can use this tool anonymously but if you want to keep an archive of your pastes you can simply setup your git credentials in your .gitconfig file. The token here is your API token which can be found under your account settings page.

[github]
        user = jgoulah
        token = 00000000000000000000000000000000

You also need either Git installed or Config::INI::Reader to allow the module to read your .gitconfig file.

Now to make this a bit easier to remember we can create an alias in our .bashrc file. In this case I’m specifying –private so that only people that I give this secure URL out to can see, and I’m also specifying to use the Gist service. The nopaste app supports a variety of other services that you can use but as of this writing Gist is the only one that supports the –private flag.

alias gist='nopaste --private --service Gist'

Now you can use the command to paste something such as a script from the command line and the gist URL is returned

$ gist somescript.sh
http://gist.github.com/df552bacbbbb8d70c089

There you have it!

Tags: , , , , ,


Mar 15 2010

Node.js, Websockets, and the Twitter Gardenhose

Category: Real-time Webjgoulah @ 3:08 PM

Introduction

In this article I’m going to demonstrate how to use the Twitter Streaming API to send a stream of status updates for real time display in a web browser using Web Sockets. We’ll implement a backend server module that streams the events over a web socket using the node.websocket.js framework, which provides a simple realtime HTTP server that implements the web socket protocol. Node.websocket.js is built on top of Node.js which provides an Asynchronous API using callbacks. Node.js is a server side javascript library that enforces an event driven style of programming which allows you to develop non-blocking code easily. This in turn allows you to write simple servers that are very CPU and Memory efficient because you don’t have multiple threads taking up shared resources. Node.js is implemented using Google’s V8 javascript engine and also the CommonJS specification which is defining a standard library for server side javascript.

Getting Started

First things first, you need node.js, which you can get from github. Install in the usual way

$ git clone git://github.com/ry/node.git
$ cd node
$ ./configure
$ make
$ sudo make install

Then we’ll need node.websocket.js which you can also get from github

$ git clone git://github.com/Guille/node.websocket.js.git

This is basically an experimental implementation of the web socket API. You can create simple server side modules and then a client side implementation that opens up a web socket to the server in which you can then exchange data in both directions. There are a few examples included that you can take a look at including a simple echo server and a chat server. Just fire up the server like so

$ node runserver.js

and then open up one of the html files in the test/websocket/ directory in your browser. One catch though, you’ll need a current browser such as Google Chrome. I’d recommend it anyway, as browsing the web with it does run a bit smoother.

You’ll also need curl, which is pretty common on any linux box these days and can be installed through your package manager.

Setting Up the Server

The way we’re going to implement the server is to use the curl command to pull from the twitter stream into a file. Twitter gives you a bunch of JSON objects back which you can then parse and display. We’ll use this file in a moment to send the data over the web socket to the browser. There is some documentation on what you can do with the API but we’ll keep it simple and search for any tweets with ‘nyc’ in them

$ curl -dtrack=nyc http://stream.twitter.com/1/statuses/filter.json  -uUSERNAME:PASSWORD > sample.json

Now its time to write some server side javascript. In the node.websocket.js checkout there is a modules/ directory. We’ll create a new module called gardenhose.js. So the full path is node.websocket.js/modules/gardenhose.js

In a nutshell this is waiting for the client to establish a connection, and then it creates a child process that tails our file from the curl command above. Anytime the file is written to the “output” listener is invoked, which runs our callback to parse the JSON into objects that we can then use to send a string back to the client with some readable information from the stream.

Lets break it down just a bit more in case you are not familiar with Node. First we are requiring the system and filesystem modules from Node. Now, node.websocket.js basically just uses the node API to implement a server in the websocket.js file. It looks at the request header to see which module to instantiate and then invokes your onData() method when a client sends over data.

Therefore the onData method is the one we need to implement in our module above. We’ll use the process object in Node to create a child process that emits an event called “output” each time the child sends data to stdout. So the addListener call sets up a callback that will be invoked when our file receives more data from the twitter stream. That data comes in the form of JSON objects, one per line. So we split on the lines to create an array of JSON objects, and loop through them. Each time through the loop we’re sending this data back to the client, which is the web browser.

Just make sure the file you pass in is the correct path to the file you are outputting to from the curl command in the monitor_file() function. Then to run the node.websocket.js server you can just invoke it like so

$ node runserver.js

However if you are running the server from another host, you may want to listen on more than just the default of localhost

$ node runserver.js --host=0.0.0.0

Setting Up the Client

The goal here is to get the realtime tweets pumping through our web browser so our client will just be a web page with a little bit of web socket javascript.

This is probably a bit more straightforward. We’re just implementing a few of the functions from the Websocket interface. First we’re instantiating the WebSocket class with the hostname and port that we’re running the server on. The gardenhose in the path is to tell our server that we want to run the gardenhose.js module that we wrote above. The onopen() function is invoked when the socket is opened, and we send over the word “start” which if you recall from above understands the client has connected and to start the child process that runs tail on our file. The onmessage() function is invoked anytime the server is sending data over the web socket, which is the information we want to show on the page, so we append it to the HTML of our hose div. If the server closes the socket then onclose() is invoked, and we display that on the page.

Conclusion

We’ve written a server side module using Node.js and the node.websocket.js framework that will send tweets over a web socket connection. We have also looked at the WebSocket API and learned how to implement some of the functions defined by its interface. Of course, you could use JQuery or your favorite javascript libraries to enhance how this looks to the user, but the basics are all here in how the communication of a real time display can work with web sockets.

References

Async I/O - http://en.wikipedia.org/wiki/Asynchronous_I/O

CommonJS - http://www.commonjs.org/ and http://wiki.commonjs.org/wiki/CommonJS

V8 - http://code.google.com/p/v8

Node.JS - http://nodejs.org

Node.Websocket.JS - http://github.com/guille/node.websocket.js

Twitter Streaming API (Gardenhose) - http://apiwiki.twitter.com/Streaming-API-Documentation

Websocket API - http://dev.w3.org/html5/websockets

Tags: , , , , , , , , , , ,


Feb 06 2010

Improve Git Emails

Category: Uncategorized, Version Controljgoulah @ 3:02 PM

This is just a quick tip on how to hack gitosis to allow you to have better subject lines in your emails. This assumes you are using gitosis, not gitolite which many people hosting their own git repositories are moving to because it supports per branch and granulated user permissions. If you are looking on a full article on how to install gitosis check out this post.

So grab gitosis

 git clone git://eagain.net/gitosis

Apply this patch to set an environment variable for the user that is doing a push

diff --git a/gitosis/serve.py b/gitosis/serve.py
index d83b1d8..a6a49a2 100644
--- a/gitosis/serve.py
+++ b/gitosis/serve.py
@@ -200,6 +200,7 @@ class Main(app.App):
             sys.exit(1)

         main_log.debug('Serving %s', newcmd)
+        os.putenv('USER', user)
         os.execvp('git', ['git', 'shell', '-c', newcmd])
         main_log.error('Cannot execute git-shell.')
         sys.exit(1)

Then install gitosis with that change

$ sudo ./setup.py install

Now, edit the /usr/local/bin/git-post-receive-email script to change the subject line to make more sense

--- /usr/local/bin/git-post-receive-email.old	2010-02-06 19:29:35.000000000 +0000
+++ /usr/local/bin/git-post-receive-email	2010-02-06 19:14:03.000000000 +0000
@@ -186,7 +186,7 @@
 	# Generate header
 	cat <<-EOF
 	To: $recipients
-	Subject: ${emailprefix} $refname_type, $short_refname, ${change_type}d. $describe
+	Subject: ${emailprefix} push from ${USER}- $refname_type, $short_refname, ${change_type}d.
 	X-Git-Refname: $refname
 	X-Git-Reftype: $refname_type
 	X-Git-Oldrev: $oldrev

If you haven’t setup email already make sure that you have the hook installed by making a symbolic link in /home/git/repositories/myrepo.git/hooks

ln -s /usr/local/bin/git-post-receive-email post-receive

And now your subject changes from this

[GIT] branch, master, updated. d61c9b93446a13093ef7f510a588c69a606ff762

to this

[GIT] push from jgoulah- branch, master, updated.

Now we know who is pushing code by glancing at the subject line. The username comes from the filenames in gitosis-admin/keydir/. Obviously you could style the subject line however you see fit.

Tags: ,


Jan 18 2010

Using Mongo and Map Reduce on Apache Access Logs

Category: Databases, Systemsjgoulah @ 9:32 PM

Introduction

With more and more traffic pouring into websites, it has become necessary to come up with creative ways to parse and analyze large data sets. One of the popular ways to do that lately is using MapReduce which is a framework used across distributed systems to help make sense of large data sets. There are lots of implementations of the map/reduce framework but an easy way to get started is by using MongoDB. MongoDB is a scalable and high performant document oriented database. It has replication, sharding, and mapreduce all built in which makes it easy to scale horizontally.

For this article we’ll look a common use case of map/reduce which is to help analyze your apache logs. Since there is no set schema in a document oriented database, its a good fit for log files since its fairly easy to import arbitrary data. We’ll look at getting the data into a format that mongo can import, and writing a map/reduce algorithm to make sense of some of the data.

Getting Setup

Installing Mongo

Mongo is easy to install with detailed documentation here. In a nutshell you can do

$ mkdir -p /data/db
$ curl -O http://downloads.mongodb.org/osx/mongodb-osx-i386-latest.tgz
$ tar xzf mongodb-osx-i386-latest.tgz

At this point its not a bad idea to put this directory somewhere like /opt and adding its bin directory to your path. That way instead of

 ./mongodb-xxxxxxx/bin/mongod &

You can just do

mongodb &

In any case, start up the daemon one of those two ways depending how you set it up.

Importing the Log Files

Apache access files can vary in the information reported. The log format is easy to change with the LogFormat directive which is documented here. In any case these logs that I’m working with are not out of the box apache format. They look something like this

Jan 18 17:20:26 web4 logger: networks-www-v2 84.58.8.36 [18/Jan/2010:17:20:26 -0500] “GET /javascript/2010-01-07-15-46-41/97fec578b695157cbccf12bfd647dcfa.js HTTP/1.1″ 200 33445 “http://www.channelfrederator.com/hangover/episode/HNG_20100101/cuddlesticks-cartoon-hangover-4″ “Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7″ www.channelfrederator.com 35119

We want to take this raw log and convert it to a JSON structure for importing into mongo, which I wrote a simple perl script to iterate through the log and parse it into sensible fields using a regular expression

#!/usr/bin/perl

use strict;
use warnings;

my $logfile = "/logs/httpd/remote_www_access_log";
open(LOGFH, "$logfile");
foreach my $logentry (<LOGFH>) {
    chomp($logentry);   # remove the newline
    $logentry =~ m/(\w+\s\w+\s\w+:\w+:\w+)\s #date
                   (\w+)\slogger:\s # server host
                   ([^\s]+)\s # vhost logger
                   (?:unknown,\s)?(-|(?:\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3},?\s?)*)\s #ip
                   \[(.*)\]\s # date again
                   \"(.*?)\"\s # request
                   (\d+)\s #status
                   ([\d-]+)\s # bytes sent
                   \"(.*?)\"\s # referrer
                   \"(.*?)\"\s # user agent
                   ([\w.-]+)\s? # domain name
                   (\d+)? # time to server (ms)
                  /x;

    print <<JSON;
     {"date": "$5", "webserver": "$2", "logger": "$3", "ip": "$4", "request": "$6", "status": "$7", "bytes_sent": "$8", "referrer": "$9", "user_agent": "$10", "domain_name": "$11", "time_to_serve": "$12"}
JSON

}

Again my regular expression probably won’t quite work on your logs, though you may be able to take bits and pieces of what I’ve documented above to use for your script. On my logs that script outputs a bunch of lines that look like this

{"date": "18/Jan/2010:17:20:26 -0500", "webserver": "web4", "logger": "networks-www-v2", "ip": "84.58.8.36", "request": "GET /javascript/2010-01-07-15-46-41/97fec578b695157cbccf12bfd647dcfa.js HTTP/1.1", "status": "200", "bytes_sent": "33445", "referrer": "http://www.channelfrederator.com/hangover/episode/HNG_20100101/cuddlesticks-cartoon-hangover-4", "user_agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7", "domain_name": "www.channelfrederator.com", "time_to_serve": "35119"}

And we can then import that directly to MongoDB. The creates a collection called weblogs in the logs database, from the file that has the output of the above JSON generator script

$ mongoimport --type json -d logs -c weblogs --file weblogs.json

We can also take a look at them and verify they loaded by running the find command, which dumps out 10 rows by default

$ mongo
> use logs;
switched to db logs
> db.weblogs.find()

Setting Up the Map and Reduce Functions

So for this example what I am looking for is how many hits are going to each domain. My servers handle a bunch of different domain names and that is one thing I’m outputting into my logs that is easy to examine

The basic command line interface to MondoDB is a kind of javascript interpreter, and the mongo backend takes javascript implementations of the map and reduce functions. We can type these directly into the mongo console. The map function must emit a key/value pair and in this example it will output a ‘1′ each time a domain is found

> map = "function() { emit(this.domain_name, {count: 1}); }"

And so basically what comes out of this is a key for each domain with a set of counts, something like

 {"www.something.com", [{count: 1}, {count: 1}, {count: 1}, {count: 1}]}

This is send to the reduce function, which sums up all those counts for each domain

> reduce = "function(key, values) { var sum = 0; values.forEach(function(f) { sum += f.count; }); return {count: sum}; };"

Now that you’ve setup map and reduce functions, you can call mapreduce on the collection

> results = db.weblogs.mapReduce(map, reduce)
...
>results
{
        "result" : "tmp.mr.mapreduce_1263861252_3",
        "timeMillis" : 9034,
        "counts" : {
                "input" : 1159355,
                "emit" : 1159355,
                "output" : 92
        },
        "ok" : 1,
}

This gives us a bit of information about the map/reduce operation itself. We see that we inputted a set of data and emmitted once per item in that set. And we reduced down to 92 domain with a count for each. A result collection is given, and we can print it out

> db.tmp.mr.mapreduce_1263861252_3.find()
> it

You can type the ‘it’ operator to page through multiple pages of data. Or you can print it all at once like so

> db.tmp.mr.mapreduce_1263861252_3.find().forEach( function(x) { print(tojson(x));});
{ "_id" : "barelydigital.com", "value" : { "count" : 342888 } }
{ "_id" : "barelypolitical.com", "value" : { "count" : 875217 } }
{ "_id" : "www.fastlanedaily.com", "value" : { "count" : 998360 } }
{ "_id" : "www.threadbanger.com", "value" : { "count" : 331937 } }

Nice, we have our data aggregated and the answer to our initial problem.

Automate With a Perl Script

As usual CPAN comes to the rescue with a MongoDB driver that we can use to interface with our database in a scripted fashion. The mongo guys have also done a great job of supporting it in a bunch of other languages which makes it easy to interface with if you are using more than just perl

#!/bin/perl

use MongoDB;
use Data::Dumper;
use strict;
use warnings;

my $conn = MongoDB::Connection->new("host" => "localhost", "port" => 27017);
my $db = $conn->get_database("logs");

my $map = "function() { emit(this.domain_name, {count: 1}); }";

my $reduce = "function(key, values) { var sum = 0; values.forEach(function(f) { sum += f.count; }); return {count: sum}; }";

my $idx = Tie::IxHash->new(mapreduce => 'weblogs', 'map' => $map, reduce => $reduce);
my $result = $db->run_command($idx);

print Dumper($result);

my $res_coll = $result->{'result'};

print "result collection is $res_coll\n";

my $collection = $db->get_collection($res_coll);
my $cursor = $collection->query({ }, { limit => 10 });

while (my $object = $cursor->next) {
    print Dumper($object);
}

The script is straightforward. Its just using the documented perl interface to make a run_command call to mongo, which passes the map and reduce javascript functions in. It then prints the results similar to how we did on the command line earlier.

Summary

We’ve gone over a very simple example of how MapReduce can work for you. There are lots of other ways you can put it to good use such as distributed sorting and searching, document clustering, and machine learning. We also took a look at MongoDB which has great uses for schema-less data. It makes it easy to scale since it has built in replication and sharding capabilities. Now you can put map/reduce to work on your logs and find a ton of information you couldn’t easily get before.

References

http://github.com/mongodb/mongo
http://www.mongodb.org/display/DOCS/MapReduce
http://apirate.posterous.com/visualizing-log-files-with-mongodb-mapreduce
http://search.cpan.org/~kristina/MongoDB-0.27/
http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/

Tags: , , , , , , , , ,


Dec 13 2009

Using Amazon Relational Database Service

Category: Cloud Computingjgoulah @ 2:09 PM

Intro

Amazon recently released a new service that makes it easier set up, operate, and scale a relational database in the cloud called Amazon Relational Database Service. The service, based on MySQL for now, has its pluses and minuses and you should decide whether it fits your needs. The advantages are that it has an automated backup system that lets you restore to any point within the last 5 minutes and also allows you to easily take a snapshot at any time. It is also very easy to “scale up” your box. This is more in the realm of vertical scaling but if you find you are hitting limits you can upgrade to a more powerful server with little to no effort. It also gives you monitoring via Amazon Cloudwatch and automatically patches your database during your self defined maintenance windows. The downfalls are that you don’t have access directly to the box itself, so you can’t ssh in. You also at this point cannot use replication for a master-slave style setup. Amazon promises to have more high availability options forthcoming. Since you can’t ssh in, you adjust mysql parameters via their db parameter group API. I’ll go over an example of this.

Creating a Database Instance

The first thing to do is install the RDS command line API, which you can grab from here. I’m not going over the details of setting this up. Its basically as simple as putting it into your path and Amazon has plenty of documentation on this.

Once you have the command line tools setup you can create a database like so

rds-create-db-instance \
        mydb-instance \
        --allocated-storage 20 \
        --db-instance-class db.m1.small \
        --engine MySQL5.1  \
        --master-username masteruser \
        --master-user-password mypass \
        --db-name mydb --headers \
        --preferred-maintenance-window 'Mon:06:15-Mon:10:15' \
        --preferred-backup-window  '10:15-12:15' \
        --backup-retention-period 1 \
        --availability-zone us-east-1c

This should be fairly self explanatory. I’m creating an instance called mydb-instance. The master (basically root) user is called masteruser with password mypass. It also creates an initial database called mydb. You can add more databases and permissions later. This also sets up the maintenance and backup windows which are required, and defined in UTC. The backup retention period is how long it holds on to my backups, which I’ve defined as 1 day. If you set this to 0 it will disable the automated backups entirely which is not advised.

The next thing to do is setup your security groups so that your EC2 (or your hosted servers) have access to your database. There is good documentation on this so I will go over a basic use case.

rds-authorize-db-security-group-ingress \
        default \
        --ec2-security-group-name webnode \
        --ec2-security-group-owner-id XXXXXXXXXXXX

In the case above I’m creating a security group called default, that allows my ec2 security group webnode access. The group-owner-id parameter is your AWS account id.

You can find what your database DNS name is via the rds-describe-db-instances command.

rds-describe-db-instances
DBINSTANCE  mydb-instance  2009-11-06T02:19:59.160Z  db.m1.small  mysql5.1  20  masteruser  available  mydb-instance.cvjb75qirgzk.us-east-1.rds.amazonaws.com  3306  us-east-1d  1
      SECGROUP  default  active
      PARAMGRP  default.mysql5.1  in-sync

So we can see our hostname is mydb-instance.cvjb75qirgzk.us-east-1.rds.amazonaws.com

Now you can login to your instance in the usual way that you access mysql on the command line, setup your users and import your database in the usual way.

mysql -u masteruser -h mydb-instance.cvjb75qirgzk.us-east-1.rds.amazonaws.com -pmypass

Using Parameter Groups to View the MySQL Slow Queries Log

As I mentioned earlier you don’t have access to ssh into the instance, so you need to use db parameter groups to tweak your configuration rather than editing the my.cnf file. You can’t see the mysql slow query log on the box but there is still a way to access it and I’ll go over that process.

Amazon won’t let you edit the default group, so the first thing to do is create a parameter group to define your custom parameters.

rds-create-db-parameter-group my-custom --description='My Custom DB Param Group' --engine=MySQL5.1

Then set the parameter to turn the query log on

rds-modify-db-parameter-group my-custom  --parameters="name=slow_query_log, value=ON, method=immediate"

We’re still using the default configuration so you have to tell the instance to use your custom parameter group

rds-modify-db-instance mydb-instance --db-parameter-group-name=my-custom

The first time you apply a new custom group you have to reboot the instance, as pending-reboot here indicates

$ rds-describe-db-instances
DBINSTANCE  mydb-instance  2009-11-06T02:19:59.160Z  db.m1.small  mysql5.1  20  masteruser  available  mydb-instance.cvjb75qirgzk.us-east-1.rds.amazonaws.com  3306  us-east-1d  1
      SECGROUP  default  active
      PARAMGRP  my-custom  pending-reboot

So we can reboot it immediately like so

$ rds-reboot-db-instance mydb-instance

When it comes back up it will show that its in-sync

$ rds-describe-db-instances
DBINSTANCE  mydb-instance  2009-11-06T02:19:59.160Z  db.m1.small  mysql5.1  20  masteruser  available  mydb-instance.cvjb75qirgzk.us-east-1.rds.amazonaws.com  3306  us-east-1d  1
      SECGROUP  default  active
      PARAMGRP  my-custom  in-sync

We can login to the instance and see that our parameter was set correctly

mysql> show global variables like 'log_slow_queries';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| log_slow_queries | ON    |
+------------------+-------+
1 row in set (0.00 sec)

Since you don’t have access to the filesystem, its logged to a table on the mysql database

mysql> use mysql;

mysql> describe slow_log;
+----------------+--------------+------+-----+-------------------+-----------------------------+
| Field          | Type         | Null | Key | Default           | Extra                       |
+----------------+--------------+------+-----+-------------------+-----------------------------+
| start_time     | timestamp    | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| user_host      | mediumtext   | NO   |     | NULL              |                             |
| query_time     | time         | NO   |     | NULL              |                             |
| lock_time      | time         | NO   |     | NULL              |                             |
| rows_sent      | int(11)      | NO   |     | NULL              |                             |
| rows_examined  | int(11)      | NO   |     | NULL              |                             |
| db             | varchar(512) | NO   |     | NULL              |                             |
| last_insert_id | int(11)      | NO   |     | NULL              |                             |
| insert_id      | int(11)      | NO   |     | NULL              |                             |
| server_id      | int(11)      | NO   |     | NULL              |                             |
| sql_text       | mediumtext   | NO   |     | NULL              |                             |
+----------------+--------------+------+-----+-------------------+-----------------------------+
11 rows in set (0.11 sec)

We may also want to set things like the slow query time, since the default of 10 is pretty high

$ rds-modify-db-parameter-group my-custom  --parameters="name=long_query_time, value=3, method=immediate"

The rds-describe-events command keeps a log of what you’ve been doing

$ rds-describe-events
db-instance         2009-12-12T17:44:19.546Z  mydb-instance  Updated to use a DBParameterGroup my-custom
db-instance         2009-12-12T17:45:51.636Z  mydb-instance  Database instance shutdown
db-instance         2009-12-12T17:46:09.380Z  mydb-instance  Database instance restarted
db-parameter-group  2009-12-12T17:56:02.568Z  my-custom        Updated parameter long_query_time to 3 with apply method immediate

And again you can check mysql that your parameter was edited properly. Note how this time we didn’t have to reboot anything as our parameter group is already active on this instance

mysql> show global variables like 'long_query_time';
+-----------------+----------+
| Variable_name   | Value    |
+-----------------+----------+
| long_query_time | 3.000000 |
+-----------------+----------+
1 row in set (0.00 sec)

Summary

In this article we went over some basics of Amazon RDS and why you may or may not want to use it. If you are just starting out its a really easy way to get a working mysql setup going. However if you are porting from an architecture with multiple slaves or other HA options this may not be for you just yet. We also went over some basic use cases on how to tweak parameters since there is no command line access to the box itself. There is command line access to mysql though, so you can use your favorite tools there.

References

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2935&categoryID=295

Tags: , , , ,


Nov 01 2009

Migrating SVN to Git

Category: Version Controljgoulah @ 5:55 PM

Overview

SVN migrations to Git can vary in complexity, depending on how old the repository is and how many branches were created and merged, as well as if you use regular SVN or a facade such as SVK to do your branching and merging. If you have a fairly new repository and the standard setup of a trunk, branches, and tags directory, your job could be pretty easy. However if you’ve done a ton of branching and merging, or your repository follows a non-standard directory setup, or that setup changed over time, you could have a bit more work on your hands.

SVN to Git Conversion

Building Git

The very first thing to do is make sure you are running the newest git. And for this I recommend from source. So grab it and do the usual compile stuff

$ wget http://kernel.org/pub/software/scm/git/git-1.6.5.2.tar.bz2
$ tar xjf git-1.6.5.2.tar.bz2
$ cd git-1.6.5.2
$ ./configure
$ make

Now I’m not sure if this is because I am always using local::lib but I tend to get this error after a while of building. If you don’t get it you can obviously skip this step

Only one of PREFIX or INSTALL_BASE can be given

This is fixable by going into the git/perl directory and manually building the perl modules

$ cd perl/
$ perl Makefile.PL

Then go back up a directory and finish the build

$ cd ..
$ make
$ sudo make install

Using git-svn

If you have made any merges using SVK then you should grab the git-svn from Sam Vilain’s git branch here. I had to download it because the git clone wasn’t working so grab it like so

$ wget http://download.github.com/samv-git-ccef8e0.tar.gz

Open it up, and copy the git-svn.perl script into your bin directory

$ tar xzvf samv-git-ccef8e0.tar.gz
$ cd samv-git-ccef8e0
$ cp git-svn.perl ~/bin/git-svn

Now you can git svn clone your repository with the git-svn command. The –prefix=svn/ is necessary because otherwise the tools can’t tell apart SVN revisions from imported ones. If you are using the standard trunk, branches, tags layout you’ll just put –stdlayout like we have below. However if you had something different you may have to pass the –trunk, –branches, and –tags in to identify what is what. For example if your repository structure was trunk/companydir and you branched that instead of trunk, you would probably want ‘–trunk=trunk/companydir –branches=branches‘. You can optionally provide an authors file, and the last option to our script is the repository URL.

$ git-svn clone --prefix=svn/ --stdlayout --authors-file=authors.txt http://example.com/svn

This can take a few minutes or as long as a few hours depending how big your repository is. When its done you should end up with a git checkout of your repository. If you look at your remote branches you’ll see that its ported our svn stuff with the svn prefix we used

$ git branch -r
  svn/dbic_versioning
  svn/feed_source
  svn/trunk

Anyway at this point it doesn’t hurt to create a tarball of your project so you can get back to it later if you screw up.

Fixing the Git Repository

There are a few more scripts you need to grab to finish this off. We want to clone the git-svn-abandon repository for these.

$ git clone git://github.com/nothingmuch/git-svn-abandon.git

Now go into that directory and copy the new scripts into your local bin directory, or wherever you feel is good for your purposes

$ cd git-svn-abandon
$ cp git-svn-abandon-* ~/bin/

You can now run git-svn-abandon-fix-refs, which will run through all the imported refs, recreating properly dated annotated tags, and makes branches out of everything else. It’ll also rename trunk to master.

$ git-svn-abandon-fix-refs

We can see that it took the things that our git branch -r command earlier listed and imported them into actual git branches, as well as putting us on the master branch which is analogous to the trunk from svn

$ git branch
  dbic_versioning
  feed_source
* master

Grafting Your Branches

Depending how correct you want your repository, you could skip this step. However it could be a good idea to try to get your merge history correct. This requires a little bit of understanding of the git internals, so at a minimum you should read and understand this introduction. There is also a nice presentation of how git works here that you may consider watching.

So now you understand (you went over those links right?) that the way git figures out merges is such that it basically creates a new node on the graph that points to both the head of master and of the branch. And then it basically takes the file contents from the 3 points, the new merge base node (the ancestor), and each parent, deltafies them and applies the result to the ancestor. Obviously if they don’t apply cleanly a conflict is created and must be resolved. The previous revisions contents is no longer relevant at all, except as a part of history, and our new snapshot of the history is born. So this is basically what we have to fix.

This is a snapshot from gitk of my repository. We can see near the bottom that it branched properly, but even though the repository was merged in svn, it doesn’t appear to be merged in our migration to git.

The way to fix this is to edit the .git/info/grafts file. With my great photoshop skills I’m going to point out what goes in there

We need to take what the first arrow is pointing to, and tell git that its parents are its current parent (the third arrow) as well as the point the branch was merged in (the second arrow). So all we have to do is add the hash from each of those commits, put the top node first, then the second and third with a space in between each to the .git/info/grafts file.

After we add those to the .git/info/grafts file we can just reload gitk or gitx to see those changes that visually show the branch was merged

When you are done you can run git-svn-abandon-cleanup which cleans up SVK style merge commit messages and removes git-svn-id strings. Another important thing that happens is the grafts entries are incorporated into the filtered commits, so the extra merge metadata becomes clonable

$ git svn-abandon-cleanup

Publish Your Repository

If you followed the last article you’ve got a gitosis setup that we can use to store our code centrally. So depending what host this is on you can import it like so

$ git remote add origin git@localhost:repository.git
$ git push --all
$ git push --tags

And we’re done.

Conclusion

We’ve gone over how to migrate an SVN repository to Git and how to deal with some of the complications of how Git interprets our repository based on what was in SVN and how to go about fixing that. Hopefully we also learned a little bit about Git too in the process.

References

Tags: , , ,


Oct 31 2009

Setting up Gitosis

Category: Version Controljgoulah @ 4:33 PM

Overview

This article is part one of a two part series that covers setting up a hosting server using gitosis for your central repository, and in the next article, taking an existing SVN repository and running the appropriate scripts and commands necessary to migrate it into something git can work with.

So this article is how to setup and manage a git repository.There are some great services out there than can do this for you, but why pay money for something you can easily do for free? This article shows how to setup and manage a secure and private git repository that people can use as a central sharing point.

Setting Up Gitosis

Gitosis is a tool for hosting git repositories. Its common usage is for a central repository that other developers can push changes to for sharing.

First clone the gitosis repository and run the basic python install. You just need the python setuptools package

sudo apt-get install python-setuptools

And then you can easily install it

git clone git://eagain.net/gitosis.git
cd gitosis
sudo python setup.py install

Next you need to create a user that will own the repositories you want to manage. You can put its home directory wherever you want, but in this example we’ll put it in the standard /home location.

sudo adduser \
    --system \
    --shell /bin/sh \
    --gecos 'git version control' \
    --group \
    --disabled-password \
    --home /home/git \
    git

Then you must create an ssh public key (or use your existing one) for your first repository user. We’ll use an init command to copy it to server and load it. If you don’t have a public key you can create one with ssh-keygen like so

ssh-keygen -t dsa

Then gitosis-init is for the first time only, loads up your users key, and goes like this

sudo -H -u git gitosis-init < ~/.ssh/id_dsa.pub

Here it doesn’t hurt to make sure your post-update hook has execute permissions.

sudo chmod 755 /home/git/repositories/gitosis-admin.git/hooks/post-update

Now you can clone the gitosis-admin repository, which is used to manage our repository permissions.

git clone git@YOUR_SERVER_HOSTNAME:gitosis-admin.git
cd gitosis-admin

Now you can see you have a gitosis.conf file and a keydir directory

$ ls -l
total 8
-rw-r--r-- 1 jgoulah mygroup   83 2009-10-31 20:44 gitosis.conf
drwxr-xr-x 2 jgoulah mygroup 4096 2009-10-31 20:44 keydir

The gitosis.conf file holds group and permission information for your repositories, and the keydir folder holds your public keys.

If I look in there I see my public key was imported from our earlier gitosis-init command

$ ls -l keydir/
total 4
-rw-r--r-- 1 jgoulah mygroup 603 2009-10-31 20:44 jgoulah.pub

So open up gitosis.conf and you should already see you have an entry for the gitosis-admin repository that we just cloned. The gitosis-init command above setup the access for us. From now on we can just crack open gitosis.conf and edit the permissions, commit and push back to our central repository.

If I wanted to create a new project for a repository called pizza_maker it would look something like this.

[group myteam]
members = jgoulah
writable = pizza_maker

Don’t forget the members section is the name of your public key file without the .pub at the end. If your key was named XYZ.pub then your member line would have XYC here.

git commit -a -m "Create new repo permissions for pizza_maker project"
git push

As a reminder the second part of this series will show an svn to git import. For now lets assume we are starting from scratch. We’d create our project like this

cd && mkdir pizza_maker
cd pizza_maker
git init
git remote add origin git@YOUR_SERVER_HOSTNAME:pizza_maker.git
git add *
git commit -m "some stuff"
git push origin master:refs/heads/master

The only other thing to know is if you want to grant another user access to your repository. All you have to do is add their public key to the keydir folder, and then give the user permissions by modifying gitosis.conf

cd gitosis-admin
cp ~/otherdude.pub keydir/
 [group myteam]
- members = jgoulah
+ members = jgoulah otherdude
  writable = pizza_maker

If you need to, you can also grant public access over the git:// protocol like so

sudo -u git git-daemon --base-path=/home/git/repositories/ --export-all

Then someone can clone like

git clone git://YOUR_SERVER_HOSTNAME/pizza_maker.git

Conclusion

This article showed how to setup gitosis, how to initialize your gitosis-admin repository, which is a unique concept in itself to use a repository to manage repositories, and it works rather well. We also went over how to create our own new git repository, and how to manage the access permissions through gitosis.conf. Part two of this series will explain how to port from your current SVN setup to a Git setup. This article was a prerequisite if you want to host your own private repository when you’re converting from SVN to Git, and thats what we’ll look at next time.

Tags: , , , , , ,


Sep 07 2009

Deploying A Catalyst App on Private Hosting

Category: Deploymentjgoulah @ 1:26 PM

Intro

In the last post I wrote about deploying Catalyst on shared hosting.  While shared hosting may seem attractive pricewise, you’ll quickly grow out of it, and it makes sense to move to hosting with some more control if you are serious about your website.   There are lots of choices when it comes to where you may want to host,  and if you are looking for virtual options, Slicehost, Linode, prgmr, or even Amazon EC2 are great choices.  Picking and setting these up are well beyond the scope of this article,  so I’m assuming you have a server with root privileges ready to go.

Setting Up Your Modules

As I probably sound like a broken record if you’ve read any of my other perl-related entries,  its a good idea to setup your modules using local::lib.  Check out the last article on how to set it up,  or the module docs do a fine job if you follow the bootstrap instructions.  I always create a new user to set this up as,  such as ‘perluser’  or similar,  this way your modules aren’t affected by upgrades as your own user.  We’ll show how to point to these modules shortly.

For now, you’ll want to also make sure  you have FCGI and FCGI::ProcManager installed as that user that you just setup along with the rest of your apps modules.

$ cpan FCGI
$ cpan FCGI::ProcManager

You may want to checkout your app under this user, so that you can just do

$ perl Makefile.PL
$ make installdeps

assuming you’ve kept your Makefile.PL updated, this should be all of the perl modules you need to run your application.

Setting Up Apache

You can setup apache using your package manager of choice.  For ease of this article I’ll assume you’re on Ubuntu, or Debian based system, and you can do something like

$ sudo apt-get install apache2.2-common
$ sudo apt-get install apache2-mpm-worker

And you also need the mod_fastcgi module.  You could download and compile it.  Or you could grab it from apt.   You’ll probably have to add the multiverse repository,  so update your /etc/apt/sources.listso that each line will end with

main restricted universe multiverse

Now you can

$ sudo apt-get update
$ sudo apt-get install libapache2-mod-fastcgi

Setup the VirtualHost and App Directory Structure

You need to put a VirtualHost so that Apache knows how to handle the requests to your domain

FastCgiExternalServer /opt/mysite.com/fake/prod -socket /opt/mysite.com/run/prod.socket

<VirtualHost *:80>
ServerName www.mysite.com

DocumentRoot /opt/mysite.com/app
Alias /static /opt/mysite.com/app/root/static/
Alias / /opt/mysite.com/fake/prod/
</VirtualHost>

This “/” alias ties your document root to the listening socket defined in the FastCgiExternalServer line. The “/static” alias make sure your static files are served by apache, instead of fastcgi.

This config also assumes some directory structure is setup,  which is really entirely up to you.  But here we’ll assume you have a directory located at /opt/mysite.com with a few directories under that called fake, run, and app.

$ sudo mkdir -p /opt/mysite.com/{fake,run,app}

The only directory you have to put anything in is app, which should contain your code.

Note, this is a very simplified layout.  In the real world I’d put the fake, run, and app dirs under a versioned directory,  which my active virtualhost would then point to.  I’ve talked briefly about this kind of deployment technique before at a high level and there is a great writeup here on the technical details on how to use multiple fastcgi servers to host your staging and production apps with seamless deployment.  Part of the beauty of using FastCGI is that you can run two copies of the app against the same socket so its easy to bring up instances pointing to different versions of your code, and deploy with zero downtime.

Launching FastCGI

The last piece of the puzzle is to have a launch script,  which makes sure that your app is listening on the socket.  So to keep it simple you would have a script called launch_mysite.sh that looks like this

#!/bin/sh

export PERL5OPT='-Mlib=/home/perl_module_user/perl5/lib/perl5'

/opt/mysite.com/app/script/mysite_fastcgi.pl \
-l /opt/mysite.com/run/prod.socket -e \
-p  /opt/mysite.com/run/prod.pid -n 15

The first line is telling the script to use the modules from the user we setup the local::lib to hold the modules, so make sure you change this to the correct location.  Then it starts up fastcgi to listen on your socket, and create a process pid,  and to spawn in this case 15 processes to handle requests. Go ahead and hit your domain, and it should show you your website.

Conclusion

We’ve gone over the basics on how to setup FastCGI using FastCgiExternalServer. You now have a lot of flexibility in how many processes are handling requests, the ability to run different copies of your app and flipping the switch, and pointing to which modules are run with your app. There are a lot of improvements you can now make to setup a very sane deployment process so that each version of code deployed can be its own standalone build and ensuring your production app has 100% uptime, but at this point its up to your imagination.

Tags: , , , , ,


Aug 29 2009

Deploying a Catalyst App on Shared Hosting

Category: Deploymentjgoulah @ 1:48 PM

Intro

People have long complained that one of the tricky things about perl is the deployment phase, much because of the intricacies of mod_perl and its unfriendliness towards shared environments. In fact, I would highly recommend FastCGI over mod_perl since it is quite easy to understand and configure. This post is going to focus on smaller shared hosting environments, and how easy it can be to quickly deploy a Catalyst web app.

Getting Started

Assumptions

There are some basic assumptions here. First we need a Linux webserver that has Apache installed and is loading up
fcgid. I believe you can also use the favored mod_fastcgi which I just pointed to above, but I have yet to test this on a shared host. These are binary compatible modules so in theory both work. But again I’ve only used mod_fastcgi for large non-shared hosted deployments. You’ll also need mod_rewrite which is now fairly common.

Installing Your Modules

The best way to install the modules your application depends on is using local::lib. I’ve talked about this before so there isn’t a lot of need to go over the process in detail again, but in a nutshell you can do

$ wget http://search.cpan.org/CPAN/authors/id/A/AP/APEIRON/local-lib-1.004006.tar.gz
$ tar xzvf local-lib-1.004006.tar.gz
$ cd local-lib-1.004006
$ perl Makefile.PL --bootstrap
$ make test && make install
$ echo 'eval $(perl -I$HOME/perl5/lib/perl5 -Mlocal::lib)' >>~/.bashrc
$ source ~/.bashrc

Now you have an environment that you can install your modules into. By default this is localized to ~/perl5. The next step is to install your modules that the application requires. It is good practice to put these into your Makefile.PL so that you can easily install them in one shot. A very basic one would follow this template

use inc::Module::Install;

name 'MyApp';
all_from 'lib/MyApp.pm';

requires 'Catalyst::Runtime' => '5.80011';
requires 'Config::General';
# require other modules here

install_script glob('script/*.pl');
auto_install;
WriteAll;

Now its easy to do

$ export PERL_MM_USE_DEFAULT=1
$ perl Makefile.PL
$ make installdeps

The PERL_MM_USE_DEFAULT will configure things such that you don’t have to press enter at every question about a dependency. The make installdeps will install any missing modules, which in this case is going to be everything. You can upgrade the version numbers in the Makefile.PL “requires” lines if you want installdeps to grab the newer distributions as they are released to CPAN.

Configuring Your App

First thing we have to do is a minor edit to our fastcgi script, which is to tell it to use our local::lib. Since its not part of the environment we setup earlier in .bashrc we have to tell the fastcgi perl script where to find things. Below the “use warnings;” line add this

use lib "/home/myuser/perl5/lib/perl5";
use local::lib;

Make sure to change the path to the correct location of your perl5 modules directory.

The last thing is to make sure your app is located in the public directory root for your host. In my case I created a symbolic link from the public_html folder to my app.

$ cd && mv public_html public_html.old  # get rid of current root folder
$ ln -s ~/myapp ~/public_html

Then create a .htaccess file in that folder, which should reside beside all of your code

Options ExecCGI FollowSymLinks
Order allow,deny
Allow from all
AddHandler fcgid-script .pl

RewriteEngine On
RewriteCond %{REQUEST_URI} !^/?script/myapp_fastcgi.pl
RewriteRule ^(.*)$ script/myapp_fastcgi.pl/$1 [PT,L]
RewriteRule ^static/(.*)$ static/$1 [PT,L]

We’re just telling apache to turn CGI on, and make sure to execute our fastcgi perl script. Be sure to change the rewrite lines to point to your script (hint: change myapp to your app name).

Now, you should be able to hit your domain. Simple!

Conclusion

So we’ve seen that deploying perl can actually be fairly easy. There are of course some assumptions here, for example, to get the rewrite rules working you’ll need mod_rewrite.so, but this is fairly standard these days. Now you can deploy a perl app with the same ease as languages such as PHP, where it is pretty much plug and play. This should enable people to more easily compete with all the badly written blog, forum, and other generically useful software in the open source world.

Tags: , , ,


Jun 11 2009

Investigating Data in Memcached

Category: Databases, Systemsjgoulah @ 9:58 PM

Intro

Almost every company I can think of uses Memcached at some layer of their stack, however I haven’t until recently found a great way to take a snapshot of the keys in memory and their associated metadata such as expire time, LRU time, the value size and whether its been expired or flushed. The tool to do this is called peep

Installing Peep

The main thing about installing peep is that you have to compile memcached with debugging symbols. Not really a big deal though. First thing you’ll want to do is install libevent

$ wget http://monkey.org/~provos/libevent-1.4.11-stable.tar.gz
$ tar xzvf libevent-1.4.11-stable.tar.gz
$ cd libevent-1.4.11-stable
$ ./configure
$ make
$ sudo make install

Now grab memcached

$ wget http://memcached.googlecode.com/files/memcached-1.2.8.tar.gz
$ tar xzvf memcached-1.2.8.tar.gz
$ cd memcached-1.2.8
$ CFLAGS='-g' ./configure --enable-threads
$ make
$ sudo make install

Note the configure line on this one sets the -g flag which enables debug symbols. Now move this directory somewhere standard like /usr/local/src because we’ll need to reference it when installing peep

$ sudo mv memcached-1.2.8 /usr/local/src

Ok, now for peep, which is written in ruby. If you don’t have rubygems you need that and the ruby headers. Just use your package management system for this. I happened to be on a CentOS box so I ran

$ sudo yum install rubygems.noarch
$ sudo yum install ruby-devel.i386

Finally you can install peep

$ sudo gem install peep -- --with-memcached-include=/usr/local/bin/memcached-1.2.8

Using Peep

Finally you can actually use this thing, you just give it the running memcached process pid

$ sudo peep --pretty `cat /var/run/memcached/memcached.pid`

Here’s a snippet of what peep can show us in its “pretty” mode

      time |   exptime |  nbytes | nsuffix | it_f | clsid | nkey |                           key | exprd | flushd
       485 |       785 |   17925 |      10 | link |    25 |   64 |  "post.ordered-posts1-03afdb" |  true |  false
       537 |       721 |   16991 |      10 | link |    24 |   63 |  "post.ordered-posts1-03bd6"  |  true |  false
       240 |      3684 |     434 |       8 | link |     9 |   22 |  "channel.channel.105.v1"     | false |  false
       241 |      3687 |   27286 |      10 | link |    27 |   55 |  "post.post_count-fec35129"   | false |  false
       538 |      4022 |    3223 |       9 | link |    17 |   55 |  "post.post_count-2ff57a7"    | false |  false
       538 |      4024 |   17169 |      10 | link |    25 |   55 |  "post.post_count-2ba928998d" | false |  false
       241 |      3686 |   10763 |      10 | link |    22 |   55 |  "post.post_count-3879a24011" | false |  false
        25 |       320 |    8874 |       9 | link |    22 |   22 |  "channel.posterframes.4"     |  true |  false

Putting the Data Into MySQL

The above is not the easiest thing to analyze, especially if you have a lot of data in cache. But we can easily load it in to MySQL so that we can run queries on it.

First create the db and permissions

mysql> create database peep;
mysql> grant all on peep.* to peep@'localhost' identified by 'peep4u';

Then a table to store the output of the peep snapshot

mysql> CREATE TABLE `entries` (
  `lru_time` int(11) DEFAULT NULL,
  `expires_at_time` int(11) DEFAULT NULL,
  `value_size` int(11) DEFAULT NULL,
  `suffix_size` int(11) DEFAULT NULL,
  `it_flag` varchar(255) DEFAULT NULL,
  `class_id` int(11) DEFAULT NULL,
  `key_size` int(11) DEFAULT NULL,
  `key_name` varchar(255) DEFAULT NULL,
  `is_expired` varchar(255) DEFAULT NULL,
  `is_flushed` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Now run peep and output the data in “ugly” format into a file

$ sudo peep --ugly `cat /var/run/memcached/memcached.pid` > peep.out

And now you can load it into your entries table

mysql> load data local infile '/tmp/peep.out' into table entries fields terminated by ' | ' lines terminated by '\n';
Query OK, 177 rows affected (0.01 sec)
Records: 177  Deleted: 0  Skipped: 0  Warnings: 0

I’m only loading a small dev install of memcahed but if you are running this on production you’d be importing many more rows. Luckily load data infile is sufficiently optimized to input large datasets. Somewhat unfortunately however, peep makes memcached block while its taking its snapshot, which can take a bit of time in production.

In any case, not that interesting with dev data but you can get lots of interesting numbers out of this. In my case it looks like most of the stuff in my cache is expired already.

mysql> select is_expired, count(*) as num, sum(value_size) as value_size from entries group by is_expired;
+------------+-----+------------+
| is_expired | num | value_size |
+------------+-----+------------+
| false      |   1 |        415 |
| true       | 176 |    2392792 |
+------------+-----+------------+
2 rows in set (0.01 sec)

Or selecting by grouping slabs and displaying their sizes

mysql> select class_id as slab_class, max(value_size) as slab_size from entries group by slab_class;
+------------+-----------+
| slab_class | slab_size |
+------------+-----------+
|          1 |         8 |
|          2 |        15 |
|          9 |       445 |
|         12 |      1100 |
|         13 |      1402 |
|         14 |      1710 |
|         15 |      2174 |
|         16 |      2409 |
|         17 |      3223 |
|         20 |      6154 |
|         21 |      8395 |
|         22 |     10763 |
|         23 |     13395 |
|         24 |     16991 |
|         25 |     17925 |
|         26 |     23320 |
|         27 |     33361 |
|         28 |     38824 |
|         29 |     44904 |
+------------+-----------+
19 rows in set (0.00 sec)

Conclusion

Although there are a lot of tools such as Cacti that will display graphs of your Memcached usage, there are not many tools that will actually show you specifics of whats in memory. This tool can be used for a variety of reasons, but its especially useful to example the keys that you have in memory and the size of their values, as well as if they’re expired and what the individuals slabs are holding.

Tags: , , , ,


Next Page »