2014年2月24日月曜日

Dockerってなんじゃ？（docker+nginxで複数コンテナにWEBサーバーをたてる）

dockerを使って複数のWEBサーバーを立ててみたいと思います。
複数の外部ポートを使うため、プロキシとしてnginxと併用してみます。

今回は２つのWEBサーバーのコンテナを立て、１つにはwordpress on apache、もう一つは素のnginxを入れてみます。

コンテナにはそれぞれ

memorycrat.cloudpack.jp
tenkaippin.cloudpack.jp

というドメインを割り当てます。
また、sshも立ちあげます。

今回はDockerfileを使ってイメージを構築します。
それぞれのコンテナのDockerfileとsupervisorの設定ファイルテンプレートは以下の様にホスト側に配置しておきます。

 $ tree .
.
└── templates
    ├── memorycraft
    │   └── conf
    │       ├── Dockerfile
    │       └── supervisor.conf
    └── tenkaippin
        └── conf
            ├── Dockerfile
            └── supervisor.conf

wordpressコンテナの設定

memorycraftコンテナでは、httpdとsshのサービスを立ちあげます。
また、wordpressをダウンロードし、RDSに接続するように設定ファイルを書き換えます。

./templates/memorycraft/conf/Dockerfile

./templates/memorycraft/conf/supervisor.conf
コンテナをビルドします。

docker build --no-cache --rm -t memorycraft/wordpress templates/memorycraft/conf/

できたら、起動します。

docker run -p 80 -p 22 -d memorycraft/wordpress /usr/bin/supervisord

nginxコンテナの設定

tenkaippinコンテナでは、nginxとsshのサービスを立ちあげます。

./templates/tenkaippin/conf/Dockerfile
./templates/tenkaippin/conf/supervisor.conf
コンテナをビルドします。

docker build --no-cache --rm -t tenkaippin/nginx templates/tenkaippin/conf/

できたら、起動します。

docker run -p 80 -p 22 -d tenkaippin/nginx /usr/bin/supervisord

ホスト側nginxの設定

まず、プロキシ用にnginxをインストールします。

# rpm -ivh http://nginx.org/packages/centos/6/noarch/RPMS/nginx-release-centos-6-0.el6.ngx.noarch.rpm
# yum install nginx -y

ここで一度コンテナの起動状況を調べてみます。

# docker ps -a
CONTAINER ID        IMAGE                          COMMAND                CREATED             STATUS              PORTS                                          NAMES
5499f267f670        tenkaippin/nginx:latest        /usr/bin/supervisord   About an hour ago   Up About an hour    0.0.0.0:49185->22/tcp, 0.0.0.0:49186->80/tcp   determined_shockley
9b361ac4aaef        memorycraft/wordpress:latest   /usr/bin/supervisord   5 hours ago         Up 5 hours          0.0.0.0:49173->22/tcp, 0.0.0.0:49174->80/tcp   loving_nobel

これで、

memorycraft:80 → 49174
tenkaippin:80 → 49186

というマッピングになっているのが分かります。（起動時にホスト側のIPを指定することも出来ます。）

次に、仮想サーバごとにプロキシ設定します。まずプロキシの基本設定です。

/etc/nginx/conf.d/proxy.conf

次に、仮想サーバーで、memorycraft.cloudpack.jpとtenkaippin.cloudpack.jpの設定をします。
プロキシの転送先ポートとして、先ほど調べたポートを指定します。

/etc/nginx/conf.d/virtual.conf

DNSの設定

次にRoute53で、２つのサブドメインでこのサーバーに向けたAレコードを登録します。

これで作業は完了です。

確認

それでは、SSH確認してみます。

memorycraft.cloudpack.jp

まずは、ホストサーバでssh接続してみます。

# ssh memorycraft@127.0.0.1 -p 49173
The authenticity of host '[127.0.0.1]:49173 ([127.0.0.1]:49173)' can't be established.
RSA key fingerprint is a2:c9:81:fb:f5:84:57:ee:06:db:8b:18:7e:3c:2a:2e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[127.0.0.1]:49173' (RSA) to the list of known hosts.
memorycraft@127.0.0.1's password:
[memorycraft@9b361ac4aaef ~]$
[memorycraft@9b361ac4aaef ~]$

接続出来ました。

次に、ブラウザを確認します。

おお！接続出来ました！

tenkaippin.cloudpack.jp

ssh接続してみます。

# ssh tenkaippin@127.0.0.1 -p 49175
ssh: connect to host 127.0.0.1 port 49175: Connection refused
[root@ip-10-157-38-165 ~]# ssh tenkaippin@127.0.0.1 -p 49185
The authenticity of host '[127.0.0.1]:49185 ([127.0.0.1]:49185)' can't be established.
RSA key fingerprint is c3:87:55:75:0b:d4:ce:f3:5c:0a:e9:71:e1:0f:fd:ca.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[127.0.0.1]:49185' (RSA) to the list of known hosts.
tenkaippin@127.0.0.1's password:
[tenkaippin@5499f267f670 ~]$
[tenkaippin@5499f267f670 ~]$

接続出来ました。

ブラウザを見てみます。

ちゃんと表示されています！

このように、サーバー資源が限られていてユーザースペースを確実に分離したい場合などには、複数コンテナをつかうと便利かもしれません。

以上です。

2014年2月18日火曜日

Dockerってなんじゃ？（S3プライベートレジストリ）

前回に引き続き、プライベートレジストリです。

前回の方法では、レジストリのコンテナを載せたサーバー自体が落ちてしまったときに登録されたコンテナイメージが全てなくなってしまいます。
dockerのregistryコンテナには、設定ファイルが存在し、永続化のオプションとしてバックエンドにS3を使うことができます。

それでは早速試してみます。

レジストリ側の設定

レジストリのコンテナをbashで起動します。

$ docker run -t -i registry /bin/bash

レジストリに入ったら設定ファイルがある/docker-registry/config/フォルダに移動します。
S3用のサンプルがあるのでそれをconfig.ymlとして使います。

dockerのレジストリでは設定ファイルに_env:VARIABLENAMEとなっている部分があり、環境変数をセットしている部分です。起動時に-eオプションでその環境変数をコンテナに渡すことが出来ます。

このS3用のファイルは、AWSキーやバケット名などに環境変数をセットできるようになっているので起動時にパラメータ渡しが可能です。今回はそのまま変更なしで使います。

# cd /docker-registry/config
# mv config.yml config.yml.org
# cp config_s3.yml config.yml
# cat config.yml
~(略)~
prod:
    storage: s3
    boto_bucket: _env:AWS_BUCKET
    s3_access_key: _env:AWS_KEY
    s3_secret_key: _env:AWS_SECRET
    s3_bucket: _env:AWS_BUCKET
    s3_encrypt: true
    s3_secure: true
    secret_key: REPLACEME
    s3_encrypt: true
    s3_secure: true
    storage_path: /images

接続を終了し、コミットします。

# exit;
# docker ps -a
CONTAINER ID        IMAGE                         COMMAND                CREATED             STATUS              PORTS                    NAMES
1e1e890b9647        registry:0.6.5                /bin/bash              6 minutes ago       Exit 0                                       grave_wozniak

# docker commit 1e1e890b9647 memorycraft/registry

コミットしたS3用のレジストリイメージを起動します。その際前述のように、-eオプションでAWSキーやバケット名などを環境変数として渡します。
また、SETTINGS_FLAVOR環境変数は、config.ymlのprodと対応しています。これによって設定ファイルの各モードを起動時に選択することができます。

# docker run -p 5000:5000 -e SETTINGS_FLAVOR=prod -e AWS_KEY=XXXXXXX -e AWS_SECRET=YYYYYYYYY -e AWS_BUCKET=memorycraft-docker-registry -d memorycraft/registry
#
#
# docker ps -a
CONTAINER ID        IMAGE                         COMMAND                CREATED             STATUS              PORTS                    NAMES
1e1e890b9647        registry:0.6.5                /bin/bash              6 minutes ago       Exit 0                                       grave_wozniak
413aa68a3ad1        memorycraft/registry:latest   /docker-registry/run   9 hours ago         Up 9 hours          0.0.0.0:5000->5000/tcp   prickly_davinci

これで、S3対応のレジストリができました。

S3レジストリへの登録

それではクライアント側からこのレジストリにpushしてみます。流れは前回の記事と同じです。

# docker ps -a
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                          NAMES
ee60c633c8f9        memorycraft/centos:latest   /usr/bin/supervisord   3 days ago          Up 3 days           0.0.0.0:49189->22/tcp, 0.0.0.0:49190->80/tcp   happy_brattain
c118bcc97b1e        539c0211cd76                /bin/bash              5 days ago          Exit 0

# docker commit ee60c633c8f9 176.34.16.242:5000/memorycraft/centos
b939188f4672b83d03e90ad12c4ad9e2ccdfa66d2f50fd44ae18ef314eee5c5b

# docker images
REPOSITORY                              TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
176.34.16.242:5000/memorycraft/centos   latest              b939188f4672        14 seconds ago      443.8 MB
memorycraft/centos                      latest              d0c94b943ba2        4 days ago          437.9 MB

# docker push 176.34.16.242:5000/memorycraft/centos
The push refers to a repository [176.34.16.242:5000/memorycraft/centos] (len: 1)
Sending image list
Pushing repository 176.34.16.242:5000/memorycraft/centos (1 tags)
539c0211cd76: Image successfully pushed
380423464fbc: Image successfully pushed
dc52da789c75: Image successfully pushed
b2ab60219415: Image successfully pushed
52b555115035: Image successfully pushed
a149f9038d0e: Image successfully pushed
3897f6889349: Image successfully pushed
bd1a450e0e46: Image successfully pushed
da6f1a424b7c: Image successfully pushed
4a8d2a1dab88: Image successfully pushed
af06476dc08c: Image successfully pushed
65ec465a844b: Image successfully pushed
318326461017: Image successfully pushed
fc6935aadec7: Image successfully pushed
9022a04f5b3f: Image successfully pushed
4787e46941f7: Image successfully pushed
30f9368972bb: Image successfully pushed
dc6de6feb9a9: Image successfully pushed
d0c94b943ba2: Image successfully pushed
b939188f4672: Image successfully pushed
Pushing tags for rev [b939188f4672] on {http://176.34.16.242:5000/v1/repositories/memorycraft/centos/tags/latest}

無事にpushできたようです。
それではS3バケットを覗いてみると、リポジトリのメタデータやイメージなどがアップされているのがわかります。

レジストリを消してみる

ここで、一度、レジストリ側のサーバーが壊れてしまった。もしくはインスタンスが消えてしまった場合を想定して、０から別のサーバーにレジストリを立てて見たいと思います。

# docker run -t -i registry /bin/bash
# cd /docker-registry/config
# mv config.yml config.yml.org
# cp config_s3.yml config.yml
# cat config.yml
# exit


# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED              STATUS              PORTS               NAMES
927bb1ccf9dc        registry:0.6.5      /bin/bash           About a minute ago   Exit 0                                  trusting_mccarthy
# docker commit 927bb1ccf9dc memorycraft/registry
# docker run -p 5000:5000 -e SETTINGS_FLAVOR=prod -e AWS_KEY=XXXXXXXXXXXXXXXX -e AWS_SECRET=YYYYYYYYYYYYYYYY -e AWS_BUCKET=memorycraft-docker-registry -d memorycraft/registry /docker-registry/run.sh

S3からpull

念のためクライアント側もイメージもコンテナも消した状態で、S3レジストリを指定してrunしてみます。

# docker run -t -i 176.34.16.242:5000/memorycraft/centos /bin/bash
Unable to find image '176.34.16.242:5000/memorycraft/centos' (tag: latest) locally
Pulling repository 176.34.16.242:5000/memorycraft/centos
b939188f4672: Download complete
da6f1a424b7c: Download complete
dc6de6feb9a9: Download complete
af06476dc08c: Download complete
b2ab60219415: Download complete
65ec465a844b: Download complete
4a8d2a1dab88: Download complete
3897f6889349: Download complete
a149f9038d0e: Download complete
539c0211cd76: Download complete
30f9368972bb: Download complete
52b555115035: Download complete
bd1a450e0e46: Download complete
318326461017: Download complete
4787e46941f7: Download complete
380423464fbc: Download complete
dc52da789c75: Download complete
d0c94b943ba2: Download complete
9022a04f5b3f: Download complete
fc6935aadec7: Download complete
bash-4.1#

ちゃんと取得できて、コンテナを立ち上げることが出来ました。
これで、S3の高い堅牢性可用性をバックエンドにしたレジストリができました。

以上です。

2014年2月17日月曜日

Dockerってなんじゃ？（プライベートレジストリ）

前回の記事で、Docker Indexというパブリックレジストリについて書きましたが、プライベートなレジストリを作ることも出来ます。それによって社外に出したくない資産などを管理することが出来ます。

Docker Indexには、プライベートレジストリ用のリポジトリが提供されていて、それをpullしてコンテナとしてプライベートレジストリとして使うことが出来るようになっています。

それでは、早速試してみたいと思います。
localhostでも構いませんが、今回はレジストリ専用のサーバーを用意してみます。
最初の記事のようにレジストリサーバー上で、Dockerをインストールしておきます。

レジストリのインストールと起動

そして、Docker Indexのregistoryというリポジトリからpullしてきいます。

# docker pull registry

あとはこれを立ち上げるだけです。レジストリの内部ポートは5000番が使用されます。今回は外部ポートも5000番で指定します。

# docker run -d -p 5000:5000 registry

commit

そしていままでのコンテナサーバー側で、コンテナをコミットしますが、指定の仕方が異なります。

docker commit <コンテナID> <レジストリサーバーIP>:<レジストリポート>/<リポジトリ名>

のようにします。

# docker ps -a
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                          NAMES
7b033c1821a7        centos:6.4                  /bin/bash              38 minutes ago      Exit 0

# docker commit 7b033c1821a7 176.34.16.242:5000/memorycraft
a2a360e08ee87d8fd3c98f08701ce0e4d681164e50432ff032890108eded996c

すると以下のようにタグ付けされたイメージが保存されます。

# docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
176.34.16.242:5000/memorycraft   latest              a2a360e08ee8        12 seconds ago      360.4 MB

push

そしてこれをpushしてみます。pushもコミットと同様の指定の仕方になります。

# docker push 176.34.16.242:5000/memorycraft
The push refers to a repository [176.34.16.242:5000/memorycraft] (len: 1)
Sending image list
Pushing repository 176.34.16.242:5000/memorycraft (1 tags)
539c0211cd76: Image successfully pushed
a2a360e08ee8: Image successfully pushed
Pushing tags for rev [a2a360e08ee8] on {http://176.34.16.242:5000/v1/repositories/memorycraft/tags/latest}

うまく自前のレジストリにむけてpushされたようです。

確認

次に、ローカルのイメージを削除してみます。

# docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
176.34.16.242:5000/memorycraft   latest              a2a360e08ee8        3 minutes ago       360.4 MB

# docker rmi a2a360e08ee8
Untagged: a2a360e08ee87d8fd3c98f08701ce0e4d681164e50432ff032890108eded996c
Deleted: a2a360e08ee87d8fd3c98f08701ce0e4d681164e50432ff032890108eded996c

# docker images
REPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE

改めて、自前のレジストリから起動してみます。

# docker run -t -i 176.34.16.242:5000/memorycraft /bin/bash
Unable to find image '176.34.16.242:5000/memorycraft' (tag: latest) locally
Pulling repository 176.34.16.242:5000/memorycraft
539c0211cd76: Download complete
a2a360e08ee8: Download complete
bash-4.1#
bash-4.1#

おお！無事起動しました。

このように、自前のレジストリを用意すると、公開したくないコンテナイメージを社内やシステム内に限定して共有することができます。

以上です。

2014年2月15日土曜日

Dockerってなんじゃ？（Docker Index レジストリ）

いままでのDockerの記事にも登場してきましたが、
Dockerにはリポジトリという概念があります。

A repository is a hosted collection of tagged images that together create the file system for a container. The repository’s name is a tag that indicates the provenance of the repository, i.e. who created it and where the original copy is located.

working with repogistry

リポジトリ

リポジトリというのはコンテナのファイルシステム（AUFS）を構成するタグ付けされたイメージのホストされた集合です。リポジトリ名は、例えば作成者とオリジナルコピーがどこにあるかなど、リポジトリの由来となるようなタグとして利用します。

レジストリ

リポジトリはレジストリ上に存在します。デフォルトではレジストリはDocker IndexというDockerがホストしているパブリックレジストリになります。Docker自身が提供しているトップレベルリポジトリもアカウント上に作られるユーザーリポジトリもあります。Docker IndexはgitでいうところのGithubのようなものになります。

アカウントを作成

Docker Indexを使うにはアカウントを作成してみます。まず、Dockerのサイトを開きます。
sign upのローカルナビのリンクをクリックするとサインアップ画面が表示されるので、情報を入力して登録します。

確認メールが届くのでアクティベーションすると、登録完了します。

マイページのAuthorized Serviceを開いてDocker Indexの「Go to application」をクリックします。

すると、Docker Index画面が開くので、ログインします。

Docker Indexの画面が表示され、自分のリポジトリ一覧画面が開かれますが、最初はなにもありません。
これからここにリポジトリをアップしてみたいと思います。

リポジトリの操作

login

いったんブラウザを離れて、コマンドライン上でDocker Indexにログインします。
ユーザー名、Eメール、パスワードを聞かれるので、先ほど登録したアカウントの情報を入力します。

# docker login
Username: memorycraft
Password:
Email: memorycraft@gmail.com
Login Succeeded

これでdockerのコマンドラインとアカウントが紐付きました。

search

最初のDocker記事で使用したcentosというリポジトリは公式のリポジトリです。
ユーザーはまず、Docker Indexレジストリ上にある公式または任意のユーザーの公開リポジトリを使用してコンテナを作っていきます。

レジストリからリポジトリを探す場合は以下のようにします。

# docker search centos
NAME                                     DESCRIPTION                                     STARS     OFFICIAL   TRUSTED
tianon/centos                            CentOS 5 and 6, created using rinse instea...   6
centos                                                                                   27
tutum/centos                             CentOS Docker image with SSH access             5
hnakamur/centos                          CentOS 6.5 x86_64 base image                    1
zwxajh/centos                            centos 6 base system.                           1
kalefranz/centos                                                                         2                    [OK]
goyalankit/centos                        Bare centos repo                                1
blalor/centos                            Bare-bones base CentOS 6.5 image                0                    [OK]
backjlack/centos                         This repository contains the following ima...   0
.......

このようにリポジトリ名などが出てきます。
（centosはDocker Index上ではオフィシャルですが、なぜかCI上だとOFFICIALに印がありません）
TRUSTEDはTrusted RepositoryというGitHubのリポジトリと連携してホストされている特殊なリポジトリです。これについてはまたの機会に触れてみたいと思います。

pull

searchで目的のリポジトリが見つかったらこれを手元に持ってきて使います。

たとえば

# docker pull centos

とすると、Dockerのレジストリからcentosリポジトリをローカルイメージとしてpullしてきます。
また、

# docker run -t -i centos /bin/bash

とすると、ローカルにある場合はローカルのcentosのイメージを、なければDockerのレジストリからcentosのリポジトリからイメージをpullして、そのままコンテナを起動します。

commit

いままでの記事にあったように、コンテナでいろいろ作業をしていくうちに、作業内容をコミットする場合があります。
コミットはコンテナをイメージに変換して保存します。その場合イメージ名は"ユーザー名/イメージ名"のようにつけます。Docker Indexのアカウントのユーザー名にしておくとDocker Indexのレジストリに登録するときにわかりやすいです。

# docker commit <container_id> <username>/<imagename>

push

pushをするとイメージをリポジトリに保存することができます。
gitのpushと同じイメージです。pushする先がデフォルトの場合、レジストリはDocker Indexに登録されます

# docker push <username>/<repo_name>

ここで、Docker Indexを見てみます。

先ほどのリポジトリ一覧に、pushしたリポジトリが1つ表示されるようになりました！

レジストリを使用することで、dockerのサーバが壊れた時の復旧や、環境を移行したり増やしたりする場合などに便利です。

今回は以上です。

2014年2月13日木曜日

Dockerってなんじゃ？（Dockerfileでビルド）

DockerにはDockerfileというものがあります。
docker buildを行うと、指定したパスからDockerfileを探し、コンテナを新規作成し記述されたステップを実行した後、コミットをしてコンテナイメージの作成までを自動で行います。

たとえば

$ docker build -t hoge/moge /path/to/contxt/

とすると、/path/to/contxt/ディレクトリにあるDockerfileを元にステップ実行したコンテナをhoge/mogeというリポジトリ名で保存するところまでを自動で行ってくれます。

Dockerfileの記述フォーマットは、基本的に

命令 引数

という形式で記載します。

命令

命令には以下のものがあります。

FROM

コンテナの元になるベースイメージの指定をします

FROM <image>

MAINTAINER

生成されるイメージのAuthorフィールドの指定をします

MAINTAINER <name>

RUN

ビルドステップ内で実行されるコマンドです

# /bin/sh -cでのコマンドとして実行されます
RUN <command>
# 実行プログラムを直接指定します
RUN ["executable", "param1", "param2"]

CMD

作成されたコンテナ起動時のデフォルトコマンドです

# 実行プログラムを直接指定します
CMD ["executable","param1","param2"]
# ENTRYPOINTのデフォルトパラメータを定義します
CMD ["param1","param2"]
# シェルとして実行されます
CMD command param1 param

EXPOSE

指定されたポートを開放します。指定するのはコンテナ内部からみたポート番号です

EXPOSE <port> [<port>...]

ENV

環境変数をセットします

ENV <key> <value>

ADD

コンテナ外のファイルをコンテナ内に配置します

ADD <src> <dest>

ENTRYPOINT

作成されたコンテナ起動時のコマンドです
CMDと異なり、docker runで引数が与えられた場合、コマンド自体が上書きされず、パラメータとして扱います

# 実行プログラムを直接指定します
ENTRYPOINT ["executable", "param1", "param2"]
# シェルとして実行されます
ENTRYPOINT command param1 param2

VOLUME

外部からボリュームとしてマウントする際のマウントポイントです

VOLUME ["/data"]

USER

コンテナを起動する際のユーザー名またはUIDです

USER daemon

WORKDIR

CMDが実行される時のワークディレクトリを指定します

WORKDIR /path/to/workdir

ONBUILD

ビルド完了後に実行される命令です
他の命令と同じタイミングではまだ実行できないような命令の場合、全てのビルドが終わってから実行されます

ONBUILD [INSTRUCTION]

以上、これらの命令を組み合わせて、コンテナの中身を組み上げていきます。

Dockerfileの作成

前回の記事でつくったsshdとhttpdをsupervisordで起動するコンテナを、上記の命令を組み合わせてDockerfileを記述して作ってみると以下のようになります。

# centosがベース
FROM centos

# 作成者
MAINTAINER memorycraft

# yumでいろいろインストール
RUN yum install sudo passwd openssh openssh-clients openssh-server httpd vim python-setuptools -y

# sshdの設定
RUN sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config
RUN sed -ri 's/#UsePAM no/UsePAM no/g' /etc/ssh/sshd_config
# キーの生成
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
# sshでログインするユーザーを用意
RUN useradd memorycraft
RUN echo 'memorycraft:memorycraft_password' | chpasswd
RUN echo 'memorycraft ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers.d/memorycraft

# httpdのトップページ
RUN echo 'moge' > /var/www/html/index.html

# supervisordのインストール
RUN easy_install supervisor
# supervisordの設定
RUN echo_supervisord_conf > /etc/supervisord.conf
RUN echo '[include]' >> /etc/supervisord.conf
RUN echo 'files = supervisord/conf/*.conf' >> /etc/supervisord.conf
RUN mkdir -p  /etc/supervisord/conf/
ADD templates/memorycraft/conf/supervisor.conf /etc/supervisord/conf/service.conf


# ポート開放
EXPOSE 22 80

# 起動時にsupervisordを実行
CMD ["/usr/bin/supervisord"]

ビルドの実行

それでは、このDockerfileをビルドしてコンテナイメージを作成します。ポイントは--rmです。
デフォルトでは１つの命令が実行されるごとにコミットが行われ、沢山のコンテナが出来てしまいますがこのオプションをつけることで、ビルドが成功した後に中間コンテナを全て消してくれます。

# docker build --no-cache --rm -t memorycraft/centos .
# docker images
REPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
memorycraft/centos   latest              d0c94b943ba2        2 minutes ago       437.9 MB
centos               6.4                 539c0211cd76        10 months ago       300.6 MB
centos               latest              539c0211cd76        10 months ago       300.6 MB

作成には成功しているようです。
続けて、このイメージでコンテナを起動してみます。

# docker run -d -p 22 -p 80 memorycraft/centos /usr/bin/supervisord
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                          NAMES
1d4a403c44c5        memorycraft/centos:latest   /usr/bin/supervisord   44 seconds ago      Up 43 seconds       0.0.0.0:49185->22/tcp, 0.0.0.0:49186->80/tcp   goofy_darwin
c118bcc97b1e        centos:6.4                  /bin/bash              18 hours ago        Exit 0

確認

それでは、このコンテナに前回同様アクセスしてみたいと思います。sshのユーザー/パスワードはDockerfileで指定したものを使います。

# ssh memorycraft@127.0.0.1 -p 49185
The authenticity of host '[127.0.0.1]:49185 ([127.0.0.1]:49185)' can't be established.
RSA key fingerprint is ea:46:dd:dd:07:64:89:6e:b6:e4:38:73:41:82:32:b7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[127.0.0.1]:49185' (RSA) to the list of known hosts.
memorycraft@127.0.0.1's password:
[memorycraft@1d4a403c44c5 ~]$

無事接続出来ました！

また、ブラウザにもアクセスしてみます。

表示されてます！
これで、Dockerfileのコーディングによるインフラ管理ができるようになります。

今回は以上です。

Dockerってなんじゃ？（Supervisorで複数のサービス起動）

Dockerネタです。
前回の記事で、DockerでSSHサーバとして起動しましたが、通常いろいろなサービスを起動したコンテナを使うことの方が多いと思います。

Dockerでは起動時に1つのCMDしか指定できないようで、前回のやり方では1つのサービスしか起動できません。このような場合は、Supervisorを利用するといいようです。

たとえば、sshdとhttpdをサービス起動したコンテナの場合、コンテナ内のsupervisordがsshdとhttpdを管理し、dockerがsupervisordを使ってサービス起動するというイメージです。

それでは早速試してみます。
コンテナは前回と同じcentosをつかいます。

コンテナの起動

コンテナを立ちあげてコンソールにアクセスします。

# docker run -t -i centos /bin/bash

sshとhttpdのインストール

コンテナに入ったらまず、必要なsshdやhttpd、supervisorインストール用のeasy?installなどのプログラムを準備します。

-bash-4.1# yum install passwd openssh openssh-clients openssh-server httpd vim python-setuptools -y

sshdの設定をします。

-bash-4.1# vim /etc/ssh/sshd_config

#UsePAM no
UsePAM yes
↓
UsePAM no
#UsePAM yes

rootユーザのパスワードを設定します。

-bash-4.1# passwd

httpdのインデックスページを用意します。

-bash-4.1# -bash-4.1# echo moge > /var/www/html/index.html

Supervisorのインストール

supervisorをインストールします。

-bash-4.1# easy_install supervisor

supervisorの起動スクリプトを用意します。

-bash-4.1# vim /etc/init.d/supervisord

#!/bin/sh
#
# /etc/rc.d/init.d/supervisord
#
# Supervisor is a client/server system that
# allows its users to monitor and control a
# number of processes on UNIX-like operating
# systems.
#
# chkconfig: - 64 36
# description: Supervisor Server
# processname: supervisord

# Source init functions
. /etc/init.d/functions

RETVAL=0
prog="supervisord"
pidfile="/tmp/supervisord.pid"
lockfile="/var/lock/subsys/supervisord"

start()
{
   echo -n $"Starting $prog: "
   daemon --pidfile $pidfile supervisord
   RETVAL=$?
   echo
   [ $RETVAL -eq 0 ] && touch ${lockfile}
}
stop()
{
   echo -n $"Shutting down $prog: "
   killproc -p ${pidfile} /usr/bin/supervisord
   RETVAL=$?
   echo
   if [ $RETVAL -eq 0 ] ; then
      rm -f ${lockfile} ${pidfile}
   fi
}
case "$1" in
  start)
    start
  ;;
  stop)
    stop
  ;;
  status)
    status $prog
  ;;
  restart)
    stop
    start
  ;;
  *)
    echo "Usage: $0 {start|stop|restart|status}"
  ;;
esac

# chmod 755 /etc/init.d/supervisord

Supervisorの設定

supervisorの設定ファイルを用意します。

-bash-4.1#  echo_supervisord_conf > /etc/supervisord.conf
-bash-4.1#  echo "[include]" >> /etc/supervisord.conf
-bash-4.1#  echo "files = supervisord/conf/*.conf" >> /etc/supervisord.conf
-bash-4.1#  mkdir -p  /etc/supervisord/conf/

設定を記載します。ここでポイントは、supervisord自身のnodaemon=trueという項目です。
これによって、supervisordがフォアグラウンド起動します。

-bash-4.1#  vim /etc/supervisord/conf/service.conf
[supervisord]
nodaemon=true

[program:sshd]
command=/usr/sbin/sshd -D
autostart=true
autorestart=true

[program:httpd]
command=/usr/sbin/httpd -c "ErrorLog /dev/stdout" -DFOREGROUND
redirect_stderr=true

ここで、コンテナを抜けます。

-bash-4.1# exit

コンテナイメージの作成

ホスト側で、一度コミットします。

# docker ps -a
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                          NAMES
c118bcc97b1e        centos:6.4                  /bin/bash              3 hours ago         Exit 0

# docker commit c118bcc97b1e memorycraft/centos

コンテナの起動

それでは、コミットしたイメージで新しいコンテナを起動してみます。
このとき、起動するコマンドに/usr/bin/supervisordを指定し、開放ポートに22と80を２つ指定します。

# docker run -d -p 22 -p 80 memorycraft/centos /usr/bin/supervisord

確認

それではコンテナに対してsshとhttpでアクセスしてみます。
docker ps -aで、ポートマッピングの状況がわかります。

# docker ps -a
CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                          NAMES
337f5660c4d0        memorycraft/centos:latest   /usr/bin/supervisord   47 minutes ago      Up 47 minutes       0.0.0.0:49173->22/tcp, 0.0.0.0:49174->80/tcp   stoic_bell
c118bcc97b1e        centos:6.4                  /bin/bash              3 hours ago         Exit 0                                                             sad_torvalds

マッピングされたポートをつかってsshアクセスしてみます。

# ssh root@127.0.0.1 -p 49173
The authenticity of host '[127.0.0.1]:49173 ([127.0.0.1]:49173)' can't be established.
RSA key fingerprint is 59:de:98:f5:21:15:bc:40:c1:29:8d:76:84:eb:59:05.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[127.0.0.1]:49173' (RSA) to the list of known hosts.
root@127.0.0.1's password:

アクセスできました！

続けて、httpです。
マッピングされたポートをつかってブラウザアクセスしてみます。

おお！確認できました！

これで複数のサービスが起動したコンテナを作れます。
今回は、以上です。

2014年2月11日火曜日

Node.jsってなんじゃ?（gm:GraphicsMagickで画像合成）

今回はnode.jsを利用して、画像合成をしてみたいと思います。
この手の画像生成にはImageMagickがよく使われますが、ImageMagickから派生したGraphicMagickがImageMagickよりもパフォーマンスが優れているのでこちらの方を利用します。

node.js内からは、GraphicMagick/ImageMagickを使用できるgmというモジュールを利用します。

今回node.jsは久しぶりなので、nodeのインストールから始めてみたいと思います。
他のLL言語などでもそうですが、nvmやenv系などインストール環境管理ツールが乱立しているようです。
今回はnodebrewが便利そうなので、nodebrewを使ってみました。

GraphickMagickのインストール

yumでインストールします。

# yum install gcc-c++ GraphicsMagick -y

nodebrewを使ったnode.jsとモジュールのインストール

nodebrewでは、インストール実行した場所だけにファイルが作られるようで環境を汚さず、rootでなくともインストールが可能なことが特徴のようです。
インストールを行い、PATHを通すだけで完了なので、とても簡単です。

# cat /etc/profile.d/nodebrew.sh
#!/bin/bash

export NODE_PATH=$HOME/.nodebrew/current/node_modules
export PATH=$HOME/.nodebrew/current/bin:$PATH

$ source /etc/profile
$ cd ~/
$ curl https://raw.github.com/hokaccha/nodebrew/master/nodebrew | perl - setup

$ nodebrew install latest
$ nodebrew use latest

$ node -v
v0.11.11

$ npm install gm
$ npm install argv

実装

image.js

画像合成のメイン処理です。引数の画像URLをダウンロードして画像合成します。

gmcomposite.js

合成を子プロセスで実行する箇所をモジュール化したものです。

画像部品の配置

今回の合成ではfacebookのプロフィール写真がどうにも無愛想なので、かわいくしてみたいと思います。合成素材用のディレクトリ、合成後の出力先のディレクトリを作成します。

$ mkdir -p assets rslt/convert rslt/composite rslt/download
$ tree ~/
/home/memorycraft/
|-- assets
|   `-- frame.png //合成素材（マスク＋ベース兼用）
|-- gmcomposite.js 
|-- image.js
|-- node_modules
|   |-- argv
|   `-- gm
|-- rslt
|   |-- composite
|   |-- convert
|   `-- download
`-- tmp

また、httpdサーバのドキュメントルートから出力先ディレクトリにリンクします。

# chmod 755 /home/memorycraft/
# cd /var/www/html
# ln -s /home/memorycraft/rslt rslt

assetsディレクトリにマスク素材を配置します。

frame.png

実行

それでは実行します。オプション引数にfacebookの画像URLと、facebook ID（画像名につかうだけなので何でも良い）を渡します。

$ node image.js -u http://fbcdn-sphotos-b-a.akamaihd.net/hphotos-ak-prn1/t1/1525482_10202990719234672_615625235_n.jpg -f memocra
create complete !

無事出力できたようです。

確認

それでは出力された画像をブラウザで確認してみます。

。。。。ちょっと思った感じと違いますが、とりあえず合成できたのでよしとします。
今回は以上です。

2014年2月10日月曜日

TreasureDataってなんじゃ？

Treasure Dataはデータの収集、保存、管理、処理、可視化などを行えるログ解析の基盤サービスで、fluentdを使ってデータを収集し、hadoopで解析を行います。
自分も以前の記事で書きましたが、よくelasticsearch + kibanaや、splunkなどの部分がサービス化されているようなイメージです。

Treasure Dataについては、中の人が非常に詳しくブログで書かれています。

それでは実際に触ってみます。

ユーザー登録＆ログイン

http://www.treasuredata.com/jp/products/try-it-now.php
にアクセスして、サインアップのリンクをクリックします。

サインアップ画面が表示されるので、必要な情報を入力して登録します。

登録が終わると、確認メールが送られてくるので、確認リンクをクリックして登録完了です。
また、ログインする場合は、以下のような画面でログインすることになります。

ログの送信

ログの収集対象となるサーバーで、apacheのログデータをTreasure Dataに送信するように設定します。
まず、ログのあるサーバーでtd accountコマンドを使って先ほど登録したメールアドレスとパスワードを設定して、アカウントを紐付けます。

# td account
Enter your Treasure Data credentials.
Email: miura@cloudpack.jp
Password (typing will be hidden):
Authenticated successfully.
Use 'td db:create <db_name>' to create a database.

次に、Treasure Dataに送信する時のAPIキーを取得するために、td apikey:showコマンドを実行します。
（APIキーは伏せてあります。）

# td apikey:show
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

次に、/etc/td-agent/td-agent.confで、sourceをapacheアクセスログのtailを、matchで今のAPIキーを使ったtdlogを設定します。tdlogがTreasure Dataにログを送信するためのプラグインになります。
auto_create_tableを設定しておくと、自動的にログ用のテーブルがTreasure Data側に用意されます。

  <source>
    type tail
    format apache
    path /var/log/httpd/access_log
    pos_file /tmp/access.log.pos
    tag td.apache.access
  </source>

  <match td.*.*>
    type tdlog
    apikey XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    auto_create_table
    buffer_type file
    buffer_path /var/log/td-agent/buffer/td
    use_ssl true
  </match>

設定したら、td-agentを起動します。

/etc/init.d/td-agent start

ログ収集対象のサーバーの設定は以上です。

td tablesコマンドを使用すると、Treasure Data側で作成されたテーブルの一覧が表示されます。

# td tables;
+-----------+------------+------+-------+--------+---------------------------+---------------------------+----------------------------------------------------------------------------------------------------------+
| Database  | Table      | Type | Count | Size   | Last import               | Last log timestamp        | Schema                                                                                                   |
+-----------+------------+------+-------+--------+---------------------------+---------------------------+----------------------------------------------------------------------------------------------------------+
| apache    | access     | log  | 26    | 0.0 GB | 2014-02-10 04:01:49 +0900 | 2014-02-10 04:00:17 +0900 | host:string, path:string, method:string, referer:string, code:long, agent:string, user:string, size:long |
| sample_db | www_access | log  | 5,000 | 0.0 GB | 2014-01-30 16:43:07 +0900 | 2013-09-07 10:13:45 +0900 | host:string, path:string, method:string, referer:string, code:long, agent:string, user:string, size:long |
+-----------+------------+------+-------+--------+---------------------------+---------------------------+----------------------------------------------------------------------------------------------------------+
2 rows in set

Treasure Dataのアカウントを作成すると、デフォルトでsample_dbというデータベースが用意されていますが、ログが保存され始めると、apacheデータベースのaccessテーブルというのが作られています。
DB名、テーブル名はsourceのtagを元に作成されます。

Treasure Dataでの集計

Treasure Dataの管理画面のDatabasesにも、apacheデータベースが追加されていることがわかります。
ドリルダウンしていくと、登録されているログデータが構造化されているのを見ることができます。

左ペインのNew Queryで集計の設定をおこないます。
データベースはapacheを選択し、今回はQuery typeはHiveを選択します。
Queryに集計するためのHiveクエリを書き、あとはそのままで「Submit」をクリックすると、Jobの実行が開始されます。Job実行はスケジューリングすることもできます。

Copy result toでMySQLやS3など結果を保存する先を指定することもできます。

Jobの実行が終わると、Jobリストとして結果が表示されます。
今回は、結果の保存先を特に指定していないので、この結果画面にそのまま結果が表示されます。

ざっとですが、td-agentを使ったログの収集、Treasure Dataの管理コンソールでのログの確認、集計をさらってみました。

今回は以上です。

2014年2月6日木曜日

Kibanaってなんじゃ？（kibana3 + elasticsearch + fluentd）

ずんぶん前にKibanaの記事を書きましたが、時がたちKibana3がとても感じがよいと巷で評判なので、再入門してみます。

インデックスサーバー側

準備

インデックスサーバー側のポートは80,22の他に、fluent用に9200番ポートを開けておきます。
また、インデックスは最終的に大きくなるので、容量の大きなストレージに入れておきます。

# yum install xfsprogs httpd java-1.7.0-openjdk -y
# mkfs.xfs /dev/xvdf
# mount -t xfs /dev/xvdf /mnt/ebs/0

Kibana3のインストール

Kibanaはv3になってから、rubyではなくhtmlになりました。DocumentRoot下に置いてhttpdを起動するだけでOKです。

# cd /var/www/
# curl -OL http://download.elasticsearch.org/kibana/kibana/kibana-latest.zip
# unzip kibana-latest.zip
# mv html html.org
# mv kibana-latest html
# /etc/init.d/httpd start

ElasticSearchのインストール

# cd /mnt/ebs/0/
# curl -OL https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.11.tar.gz
# tar xzvf elasticsearch-0.90.11.tar.gz
# cd elasticsearch-0.90.11/
# ./bin/elasticsearch start

これで、インデックスサーバー側は準備ができました。

ログ送信サーバー側

apacheのログを送信するとして、ここではfluentdの設定はin:tailとout:elasticsearchプラグインを使います。

# yum install td-agent -y
# vim /etc/td-agent/td-agent.conf

<source>
  type tail
  format apache2
  path /var/log/httpd/access_log
  pos_file /tmp/access.log.pos
  tag server1.apache.access
</source>

<match server1.apache.access>
  type_name apache
  type elasticsearch
  include_tag_key true
  tag_key @log_name
  host XXX.XXX.XXX.XXX
  port 9200
  logstash_format true
  flush_interval 10s
</match>

# /etc/init.d/td-agent start

これでログの送信設定は完了です。

確認

ここまで設定できたらkibanaの画面を見てみます。
これがデフォルトのトップ画面です。
右側の[Logstash Dashboard]というリンクをクリックすると、ダッシュボード画面に遷移します。

ダッシュボードでは、fluentdから送られてきたログがひな形のダッシュボードに表示されていることがわかります。

クエリの追加

クエリフィールドはフィールドの「＋」ボタンでいくつも登録できます。
クエリの書式はluceneの書式が基本となっているようです。

パネルの追加

ダッシュボードはグリッド上にできており、基本的に行（ROW）にパネルを追加していきます。
パネルはひとつまたは複数のクエリを使用します。

ROWにある「Add Panel」ボタンでパネルを追加します。

パネル追加画面では、パネルのタイプや使用するクエリや、その他パネル固有のパラメータを設定します。

いくつかのクエリとパネルを組み合わせて目的に会ったダッシュボードを造ります。

作成したら、名前をつけてダッシュボードを保存することで、リロードしても保持されるようになります。

以前と比べてかなりいろいろなデータを表示できるようになってきました。
細かいところはまた今度。

以上です。

2014年2月5日水曜日

EMRってなんじゃ？（ImpalaでCloudfrontの爆速ログ集計）

EMRがImpalaをサポートするようになりました。ImpalaはClouderaが提供するオープンソースのクエリエンジンで、Hiveより断然速いそうです。

例として、Cloudfrontのログを、タイムスタンプをJSTに直して１時間ごとのアクセス数の集計をしてみます。

ログバケットの確認

まずCloudFrontのログが以下のS3にたまっているとします。
s3://memorycraft-impala-input/cf/logs

EMRクラスタの起動

次に、EMRクラスタを起動します。
EMRのダッシュボードで「Create Cluster」をクリックし、新規クラスタ作成画面を表示します。

Cluster Configuration

Cluster nameに適当なクラスタ名を入力します。また、今回はEMRのログは出力しないのでLoggingのチェックはOFFなんかにしておきます。起動したインスタンスの名前をつける場合は、TagsのNameとしてインスタンス名をつけておきます。

Software Configuration

Hadoopのバージョンとアプリケーションのリストを設定します。
Applications to be installedのリストにはデフォルトでHiveとPigが追加されていますが、
AMI versionは現時点で最新の3.0,2を選ぶち、下のAdditional applicationsのドロップダウンでImpalaが選択できるようになります。
Impalaを選択し、「Configure and add」をクリックして追加します。

Hardware Configuration

ここでVPCやゾーンやインスタンス数を設定します。
ここでは、EC2-Classicで、ゾーンはA、インスタンス数はデフォルトにしてみます。

Security and Access

EC2 key pairで、hadoopインスタンスに接続するときのキーペア名を設定します。
ここが未定義だと、インスタンスに接続できないので注意が必要です。

Bootstrap Actions & Steps

ここでは、起動した後の挙動や起動パラメータなどを定義します。
今回は未設定のままでOKです。
最後に「Create cluster」ボタンをクリックして、クラスタを作成します。

EMRのインスタンスへの接続

起動完了するとクラスタ詳細画面で、Master public DNSに、マスタインスタンスのPublicDNSが表示されます。

そのPublicDNSに対して、先ほど選択した鍵を使ってhadoopユーザーでSSH接続します。

$ ssh -i ~/.ssh/keys/memorycraft.pem hadoop@ec2-54-199-41-72.ap-northeast-1.compute.amazonaws.com
The authenticity of host 'ec2-54-199-41-72.ap-northeast-1.compute.amazonaws.com (54.199.41.72)' can't be established.
RSA key fingerprint is 0f:ac:d5:8d:50:2b:7c:92:6a:ad:74:5e:f3:d1:52:05.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-54-199-41-72.ap-northeast-1.compute.amazonaws.com,54.199.41.72' (RSA) to the list of known hosts.

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.09-release-notes/
13 package(s) needed for security, out of 46 available
Run "sudo yum update" to apply all updates.
--------------------------------------------------------------------------------

Welcome to Amazon Elastic MapReduce running Hadoop and Amazon Linux.

Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the following commands:

  ResourceManager    lynx http://localhost:9026/
  NameNode           lynx http://localhost:9101/

--------------------------------------------------------------------------------
$

接続が完了しました。

S3データのコピー

現状、ImpalaでExternalテーブルにS3のパスを直接指定するとエラーになってしまいました。方法がわからなかったため、S3DistCpでHDFS上の/input/にコピーしてきます。

$ vim ./import.sh

#!/bin/bash

. /home/hadoop/impala/conf/impala.conf

hadoop jar /home/hadoop/share/hadoop/common/lib/EmrS3DistCp-1.0.jar -Dmapreduce.job.reduces=30 --src s3://memorycraft-impala-input/cf/logs/ --dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/input/ --outputCodec 'none'

$ chmod 755 ./import.sh
$ ./import.sh
14/02/02 20:12:41 INFO s3distcp.S3DistCp: Running with args: -Dmapreduce.job.reduces=30 --src s3://memorycraft-impala-input/cf/logs/ --dest hdfs://172.31.0.95:9000/input/ --outputCodec none
14/02/02 20:12:41 INFO s3distcp.S3DistCp: S3DistCp args: --src s3://memorycraft-impala-input/cf/logs/ --dest hdfs://172.31.0.95:9000/input/ --outputCodec none
14/02/02 20:12:45 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/8f7379de-1ebf-4f31-be50-bd17aa54f2d5/output'
14/02/02 20:12:46 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: ap-northeast-1a
14/02/02 20:12:46 INFO s3distcp.S3DistCp: Created AmazonS3Client with conf KeyId XXXXXXXXXXXXXX
〜（中略）〜
  Combine output records=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters
  Bytes Read=3406
 File Output Format Counters
  Bytes Written=0
14/02/02 20:13:37 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/8f7379de-1ebf-4f31-be50-bd17aa54f2d5/tempspace

コピーの実行が終わりました。HDFSの内容を見ると、コピーされているのが分かります。

$ hadoop fs -ls /input/
Found 20 items
-rw-r--r--   1 hadoop supergroup       1728 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.21fcnUK5
-rw-r--r--   1 hadoop supergroup       1726 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.9jiEU8sU
-rw-r--r--   1 hadoop supergroup       1477 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.9pD0UZAy
-rw-r--r--   1 hadoop supergroup       4996 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.CECpH1q4
-rw-r--r--   1 hadoop supergroup       3636 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.DmeAD298
-rw-r--r--   1 hadoop supergroup        979 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.GhmYMPjI
-rw-r--r--   1 hadoop supergroup       1915 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.IvAE5n9h
-rw-r--r--   1 hadoop supergroup       1853 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.K3Crm40n
-rw-r--r--   1 hadoop supergroup       6120 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.KILmQ81g
-rw-r--r--   1 hadoop supergroup        969 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.KdRXIw4s
-rw-r--r--   1 hadoop supergroup       1731 2014-02-02 20:13 /input/E2KNLXPGJLJRRQ.2014-01-30-07.MIAnZGc4
.....

Impalaの実行

Impalaはimpala-shellというコマンドで専用のプロンプトを起動して使います。hiveのhiveコマンドと同じようなものです。

$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to ip-10-146-59-193.ap-northeast-1.compute.internal:21000
Server version: impalad version 1.2.1 RELEASE (build 8c1da7709727f3545974009a4bb677a0004024ec)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.2.1 (8c1da77) built on Sun Dec  1 20:57:24 PST 2013)
[ip-10-146-59-193.ap-northeast-1.compute.internal:21000] >

まずは、入力テーブルをつくります。ファイルの場所はS3DistCpのコピー先に指定した/input/を指定します。

[ip-10-146-59-193.ap-northeast-1.compute.internal:21000] >
CREATE EXTERNAL TABLE IF NOT EXISTS input (
  cf_date STRING,
  cf_time STRING,
  x_edge_location STRING,
  sc_bytes INT,
  c_ip STRING,
  cs_method STRING,
  cs_host STRING,
  cs_uri_stem STRING,
  sc_status STRING,
  cs_referrer STRING, 
  cs_user_agent STRING,
  cs_uri_query STRING,
  cs_cookie STRING,
  x_edge_result_type STRING,
  x_edge_request_id STRING,
  x_host_header STRING,
  cs_protocol STRING,
  cs_bytes INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/input/';

それでは、クエリを実行してみます。

[ip-10-146-59-193.ap-northeast-1.compute.internal:21000] >
SELECT
  w.jsttime,
  count(mydata)
FROM
  (
    SELECT
      SUBSTR(
        CAST(
          FROM_UTC_TIMESTAMP(
            FROM_UNIXTIME(
              UNIX_TIMESTAMP(CONCAT(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')
            ), 'JST'
          ) as STRING
        ), 1, 13
      ) AS jsttime,
      'AAA' AS mydata 
    FROM
      input
    WHERE
      cf_date NOT LIKE '#%'
  ) w
GROUP BY
  w.jsttime
;
Query: select w.jsttime, count(mydata) from (select substr(cast(from_utc_timestamp(from_unixtime(unix_timestamp(concat(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')), 'JST') as STRING), 1, 13) as jsttime, 'AAA' as mydata  from input where cf_date not like '#%') w group by w.jsttime
+---------------+---------------+
| jsttime       | count(mydata) |
+---------------+---------------+
| 2014-01-30 16 | 95            |
| 2014-01-30 17 | 33            |
+---------------+---------------+
Returned 2 row(s) in 0.88s

1秒かかりませんでした。

Hiveとの速度比較

これと同じ集計をHiveで行ってみます。

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1391362544864_0002, Tracking URL = http://10.132.128.116:9046/proxy/application_1391362544864_0002/
Kill Command = /home/hadoop/bin/hadoop job  -kill job_1391362544864_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-02-02 17:46:18,013 Stage-1 map = 0%,  reduce = 0%
2014-02-02 17:46:29,536 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:30,585 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:31,646 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:32,707 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:33,773 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:34,822 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:35,872 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:36,925 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.92 sec
2014-02-02 17:46:37,975 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.85 sec
2014-02-02 17:46:39,034 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.85 sec
2014-02-02 17:46:40,083 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.85 sec
MapReduce Total cumulative CPU time: 6 seconds 850 msec
Ended Job = job_1391362544864_0002
Counters:
MapReduce Jobs Launched:
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 6.85 sec   HDFS Read: 58636 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 850 msec
OK
2014-01-30 16 95
2014-01-30 17 33
Time taken: 40.334 seconds, Fetched: 2 row(s)

40秒も掛かりました。いかにImpalaが速いかがわかります。

8億レコードでやってみた

上と全く同じ内容のクエリを使って、8億レコードのCloudFrontログ解析をImpala上で実行してみました。

データコピー（S3→HDFS）

$ . /home/hadoop/impala/conf/impala.conf
$
$ hadoop jar /home/hadoop/share/hadoop/common/lib/EmrS3DistCp-1.0.jar -Dmapreduce.job.reduces=30 --src s3://cloudfront-big-log/logs/cf/ --dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/input/ --outputCodec 'none'
14/02/03 05:21:10 INFO s3distcp.S3DistCp: Running with args: -Dmapreduce.job.reduces=30 --src s3://cloudfront-big-log/logs/cf/ --dest hdfs://172.31.0.95:9000/input/ --outputCodec none
14/02/03 05:21:10 INFO s3distcp.S3DistCp: S3DistCp args: --src s3://cloudfront-big-log/logs/cf/ --dest hdfs://172.31.0.95:9000/input/ --outputCodec none
14/02/03 05:21:14 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/9dc925c6-0f4e-4043-a19a-f21930f89b6b/output'
14/02/03 05:21:15 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: ap-northeast-1a
14/02/03 05:21:15 INFO s3distcp.S3DistCp: Created AmazonS3Client with conf KeyId XXXXXXXXXXXXXXXXXXXXXXXX
14/02/03 05:21:16 INFO s3distcp.S3DistCp: Skipping key 'logs/cf/' because it ends with '/'
14/02/03 05:21:16 INFO s3distcp.FileInfoListing: Opening new file: hdfs:/tmp/9dc925c6-0f4e-4043-a19a-f21930f89b6b/files/1
14/02/03 05:21:57 INFO s3distcp.S3DistCp: Created 1 files to copy 211761 files
14/02/03 05:21:57 INFO s3distcp.S3DistCp: Reducer number: 29
〜（中略）〜
  CPU time spent (ms)=16164160
  Physical memory (bytes) snapshot=18180935680
  Virtual memory (bytes) snapshot=53996851200
  Total committed heap usage (bytes)=14206107648
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters
  Bytes Read=39560891
 File Output Format Counters
  Bytes Written=0
14/02/03 06:30:50 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/9dc925c6-0f4e-4043-a19a-f21930f89b6b/tempspace

入力テーブル作成（input）

$ impala-shell
>
> CREATE EXTERNAL TABLE IF NOT EXISTS input (
>   cf_date STRING,
>   cf_time STRING,
>   x_edge_location STRING,
>   sc_bytes INT,
>   c_ip STRING,
>   cs_method STRING,
>   cs_host STRING,
>   cs_uri_stem STRING,
>   sc_status STRING,
>   cs_referrer STRING,
>   cs_user_agent STRING,
>   cs_uri_query STRING,
>   cs_cookie STRING,
>   x_edge_result_type STRING,
>   x_edge_request_id STRING,
>   x_host_header STRING,
>   cs_protocol STRING,
>   cs_bytes INT
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/input/';
Query: create EXTERNAL TABLE IF NOT EXISTS input ( cf_date STRING, cf_time STRING, x_edge_location STRING, sc_bytes INT, c_ip STRING, cs_method STRING, cs_host STRING, cs_uri_stem STRING, sc_status STRING, cs_referrer STRING, cs_user_agent STRING, cs_uri_query STRING, cs_cookie STRING, x_edge_result_type STRING, x_edge_request_id STRING, x_host_header STRING, cs_protocol STRING, cs_bytes INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/input/'

Returned 0 row(s) in 266.06s

入力件数

> select count(*) from input;
Query: select count(*) from input

+-----------+
| count(*)  |
+-----------+
| 820545673 |
+-----------+

Returned 1 row(s) in 642.03s

出力テーブル作成（output）

> CREATE EXTERNAL TABLE IF NOT EXISTS output_pv (
>   dth STRING,
>   cnt BIGINT
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/output_pv/';
Query: create EXTERNAL TABLE IF NOT EXISTS output_pv ( dth STRING, cnt BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/output_pv/'

Returned 0 row(s) in 2.23s

解析（input→output）

> insert into output
> select w.jsttime, count(mydata) from
> (select substr(cast(from_utc_timestamp(from_unixtime(unix_timestamp(concat(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')), 'JST') as STRING), 1, 13) as jsttime, 'AAA' as mydata  from input where cf_date not like '#%') w
> group by w.jsttime
> ;
Query: insert into output select w.jsttime, count(mydata) from (select substr(cast(from_utc_timestamp(from_unixtime(unix_timestamp(concat(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')), 'JST') as STRING), 1, 13) as jsttime, 'AAA' as mydata  from input where cf_date not like '#%') w group by w.jsttime

Inserted 759 rows in 14940.46s

結果

データのコピー（S3→HDFS）	約1時間
テーブル作成（input）	約4分
テーブル作成（output）	約2秒
解析（input→output）	約4時間

想像していたより全然速いです。

ちなみに、特定のURIに絞ってクエリを実行したら15分ほどで終わってしまいました。

> insert into output_pv
> select w.jsttime, count(mydata) from
> (select substr(cast(from_utc_timestamp(from_unixtime(unix_timestamp(concat(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')), 'JST') as STRING), 1, 13) as jsttime, 'AAA' as mydata  from input where cf_date not like '#%' and (cs_uri_stem = '/' or cs_uri_stem = '/index.html')) w
> group by w.jsttime
> ;
Query: insert into output_pv select w.jsttime, count(mydata) from (select substr(cast(from_utc_timestamp(from_unixtime(unix_timestamp(concat(cf_date, " ", cf_time), 'yyyy-MM-dd HH:mm:ss')), 'JST') as STRING), 1, 13) as jsttime, 'AAA' as mydata  from input where cf_date not like '#%' and (cs_uri_stem = '/' or cs_uri_stem = '/index.html')) w group by w.jsttime

Inserted 755 rows in 852.34s

今回はHiveで8億レコードは試せませんでしたが、いままでHiveを使用していた感覚からするとImpalaは異常なほどの速さです。
これからEMRでクエリを使用する場合はImpala一択になりそうです。

以上です。

2014年2月4日火曜日

Dockerってなんじゃ？

またまた今更ですが、Dockerを触ってみました。
Dockerについてはこのスライドがこれ以上ないくらいに、わかりやすく説明してくれています。

今回は、Docker内にCentOSを立ちあげて、SSH接続するまでをやってみます。

Dockerのインストールと起動

docker-ioだと少し古いので、epel-testing というリポジトリからインストールします。

# yum install --enablerepo=epel-testing docker-io-0.7.6-2.el6

docker searchコマンドでcentosのベースイメージを探します。

# docker search centos
NAME                                     DESCRIPTION                                     STARS   
tutum/centos                             CentOS Docker image with SSH access             5
tianon/centos                            CentOS 5 and 6, created using rinse instea...   6
centos                                                                                   23
.......

centosというのが一般的なものらしいので、それを使ってコンテナを起動します。

# docker run -t -i centos /bin/bash

すると、起動したコンテナ内のbashコンソールに自動的に接続されます。
ここからはゲストOSの世界です。
あとで、sshでログインするために、rootのパスワードを設定します。

-bash-4.1# passwd

ここで、まずコンテナ内にopensshをインストールします。

-bash-4.1# yum install openssh openssh-clients openssh-server vim -y

次に、sshd_configで、PAMを使用しないように設定します。（CentOSの場合）

-bash-4.1# vim /etc/ssh/sshd_config

#UsePAM no
UsePAM yes
↓
UsePAM no
#UsePAM yes

ここまで設定したら、いったんコンテナを抜けてホストOSの世界に戻ってきました。

-bash-4.1# exit

SSH用コンテナの起動

docker ps -aで、現在起動しているゲストOSの一覧を確認します。
centosのコンテナが１つあります。コンテナには一意のID（9c4fd33c4786）がついていて、状態はExitになっています。

# docker ps -a
CONTAINER ID        IMAGE                 COMMAND             CREATED              STATUS              PORTS                   NAMES
9c4fd33c4786        centos:6.4            /bin/bash           About a minute ago   Exit 0                                      kickass_pare

ここで、この9c4fd33c4786のコンテナをdocker commitコマンドで、イメージ化します。
イメージのリポジトリ名はsshd/centosとします。

# docker commit 9c4fd33c4786 sshd/centos
4e3f5f7e23b22900df3d3752ba4644f2282a5d07176b472b03753bf4ed1bd191

そして、SSH用に、いま作ったsshd/centosのイメージで改めて起動しますが、先ほどとは違い、-dオプションでデーモン化し、sshのポート番号と起動フォアグラウンドプロセスを指定します。それによってsshが起動した状態でコンテナが起動します。

# docker run -d -p 22 sshd/centos /usr/sbin/sshd -D
76451b30b156e0ebb9003152eae0dbe41aa7252d27451fb30a00cdb011ad5e20

SSH接続

起動したSSH用のコンテナは、docker内部では22番ポートですが、ホストOSに対しては別のポートで開いています。docker port コマンドで、コンテナの22番ポートに接続するためのポート番号を調べます。

# docker ps -a

CONTAINER ID        IMAGE                 COMMAND             CREATED             STATUS              PORTS                   NAMES
4485f95b6c81        sshd/centos:latest   /usr/sbin/sshd -D   25 minutes ago      Up 25 minutes       0.0.0.0:49158->22/tcp   drunk_curie
9c4fd33c4786        centos:6.4            /bin/bash           About a minute ago   Exit 0                                      kickass_pare
#
#
# docker port 4485f95b6c81 22
0.0.0.0:49158

どうやら49158ポートにフォワードされるようです。
また、ifconfigをみると、docker0というインターフェースでアクセス先のipが分かります。

# ifconfig
docker0   Link encap:Ethernet  HWaddr FE:9D:BA:61:73:4F
          inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:20164 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44290 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1155381 (1.1 MiB)  TX bytes:64015506 (61.0 MiB)

これで、このコンテナは172.17.42.1の49158でSSHに接続できることが分かりました。
早速接続してみます。

# ssh root@172.17.42.1 -p 49158
The authenticity of host '[172.17.42.1]:49158 ([172.17.42.1]:49158)' can't be established.
RSA key fingerprint is f2:72:a0:0d:65:58:bf:c9:d1:3f:10:6c:9e:1c:de:9a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[172.17.42.1]:49158' (RSA) to the list of known hosts.
root@172.17.42.1's password:
-bash-4.1#

うまくつながりました！
とりあえず今回は以上です。