2013年3月11日月曜日

EMRってなんじゃ?(EMRの詳細なログをみる)

EMRをつかっていて、ステップの途中でエラーが出たときに、デバッグ用のログ出力先を設定すると実行ログがAWSコンソール上で見ることができます。

例として、S3のログバケットをまとめるHiveScriptを動かした時のエラーを見てみます。


エラーが発生しています。
LogURIを指定していると、このようにコンソールからLogFileとして確認することができます。
エラーが起こった3番目のステップのそれぞれのリンクをみてみると以下のようになっています。


controller
2013-03-11T11:31:41.005Z INFO Fetching jar file.
2013-03-11T11:31:48.146Z INFO Working dir /mnt/var/lib/hadoop/steps/3
2013-03-11T11:31:48.146Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-core.jar:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-core-0.20.205.jar:/home/hadoop/hadoop-tools-0.20.205.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp -Djava.library.path=/home/hadoop/native/Linux-i386-32 org.apache.hadoop.util.RunJar /mnt/var/lib/hadoop/steps/3/script-runner.jar s3://ap-northeast-1.elasticmapreduce/libs/hive/hive-script --run-hive-script --base-path s3://ap-northeast-1.elasticmapreduce/libs/hive/ --hive-versions latest --args -f s3n://memorycraft-archive/script/s3.hql -d INPUT_BUCKET_LOCATION=s3://memorycraft-log/logs/ -d OUTPUT_BUCKET_LOCATION=s3://memorycraft-archive/rslt/ -d YYYY=2013 -d MM=03 -d DD=04
2013-03-11T11:34:20.383Z INFO Execution ended with ret val 255
2013-03-11T11:34:20.384Z WARN Step failed with bad retval
2013-03-11T11:34:26.454Z INFO Step created jobs: job_201303111128_0001


stderr
Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.8.1/conf/hive-log4j.properties
Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201303111132_1431275973.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
OK
Time taken: 13.475 seconds
OK
Time taken: 0.765 seconds
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201303111128_0001, Tracking URL = http://ip-10-121-17-157.ap-northeast-1.compute.internal:9100/jobdetails.jsp?jobid=job_201303111128_0001
Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.121.17.157:9001 -kill job_201303111128_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-03-11 11:33:11,849 Stage-1 map = 0%,  reduce = 0%
2013-03-11 11:34:12,380 Stage-1 map = 0%,  reduce = 0%
2013-03-11 11:34:19,431 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201303111128_0001 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201303111128_0001_m_000002 (and more) from job job_201303111128_0001
Exception in thread "Thread-47" java.lang.RuntimeException: Error while reading from task log url
 at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
 at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
 at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
 at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://10.122.24.232:9103/tasklog?taskid=attempt_201303111128_0001_m_000000_1&start=-8193
 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
 at java.net.URL.openStream(URL.java:1010)
 at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
 ... 3 more
Counters:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Command exiting with ret '255'


stdout
Downloading 's3://ap-northeast-1.elasticmapreduce/libs/hive/hive-script' to '/mnt/var/lib/hadoop/steps/3/.'
2013-03-11 11:31:52 GMT - INFO Running: /home/hadoop/.versions/hive-0.8.1/bin/hive '-f' 's3n://memorycraft-archive/script/s3.hql' '-d' 'INPUT_BUCKET_LOCATION=s3://memorycraft-log/logs/' '-d' 'OUTPUT_BUCKET_LOCATION=s3://memorycraft-archive/rslt/' '-d' 'YYYY=2013' '-d' 'MM=03' '-d' 'DD=04'
2013-03-11 11:34:20 GMT - ERROR Error executing cmd: /home/hadoop/.versions/hive-0.8.1/bin/hive '-f' 's3n://memorycraft-archive/script/s3.hql' '-d' 'INPUT_BUCKET_LOCATION=s3://memorycraft-log/logs/' '-d' 'OUTPUT_BUCKET_LOCATION=s3://memorycraft-archive/rslt/' '-d' 'YYYY=2013' '-d' 'MM=03' '-d' 'DD=04'


syslog
2013-03-11 11:31:52,225 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening '/libs/hive/hive-script' for reading


一番詳しく出ているのが、stderrのようですが、これだけではよくわかりません。
実際は、AWSコンソール上で確認できるEMRログはこれで全てではなく、詳細なタスクのログがEMRログバケットに出力されています。


追記:
TaskとTask Attemptsに関しては、右側のView Jobsから掘っていくことで、EMRのDebugダイアログでも確認でしましたが、それ以外のログはやはりS3バケットでしかみれなそうです。


EMRログバケットを見てみます。


ジョブフローIDごとにフォルダがあり、その中のstepsというフォルダが、AWSコンソールで表示されるログが入っています。


steps/3 以下に先ほどのログファイルが存在しています。
これ以上くわしい実行タスクのログを見るには、task-attemptsというフォルダに詳細なログが出力されています。



このtask-attempts以下の各サブフォルダにあるsyslogというファイルに詳細なログが出力されています。


task-attempts/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/syslog
2013-03-11 11:33:13,112 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2013-03-11 11:33:13,268 INFO org.apache.hadoop.mapred.TaskRunner (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/distcache/-5343292120509370400_-2048022790_1496912286/10.121.17.157/mnt/var/lib/hive_081/tmp/scratch/hive_2013-03-11_11-32-20_837_75945303502865249/-mr-10001/3d096d97-37e9-47c7-a184-8152010d592a <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/HIVE_PLAN3d096d97-37e9-47c7-a184-8152010d592a
2013-03-11 11:33:13,354 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/job.jar <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/job.jar
2013-03-11 11:33:13,362 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/.job.jar.crc <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/.job.jar.crc
2013-03-11 11:33:13,370 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/META-INF <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/META-INF
2013-03-11 11:33:13,378 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/org <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/org
2013-03-11 11:33:13,386 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/hive-exec-log4j.properties <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/hive-exec-log4j.properties
2013-03-11 11:33:13,430 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager (main): Creating symlink: /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/jars/javaewah <- /mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201303111128_0001/attempt_201303111128_0001_m_000000_0/work/javaewah
2013-03-11 11:33:13,838 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2013-03-11 11:33:14,037 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-122-12-88.ap-northeast-1.compute.internal
2013-03-11 11:33:14,088 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2013-03-11 11:33:14,106 INFO org.apache.hadoop.mapred.Task (main):  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1dfd868
2013-03-11 11:33:14,342 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2013-03-11 11:33:14,358 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2013-03-11 11:33:14,358 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2013-03-11 11:33:14,367 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2013-03-11 11:33:14,367 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2013-03-11 11:33:14,452 WARN org.apache.hadoop.hive.conf.HiveConf (main): hive-site.xml not found on CLASSPATH
2013-03-11 11:33:16,876 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://memorycraft-log/logs/2013-03-04-09-16-51-45565F6169B3E2F2' for reading
2013-03-11 11:33:17,238 INFO org.apache.hadoop.io.retry.RetryInvocationHandler (main): Exception while invoking retrievePair of class org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore. Not retrying.Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2981A88871482E0B, AWS Error Code: InvalidObjectState, AWS Error Message: The operation is not valid for the object's storage class, S3 Extended Request ID: 7yHCAsfLHZKbf+5ojWMS9q2B41bBwizoqW4zldSdfEpLoLrKvI9X1WI0brHGsC49
 at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:548)
 at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:288)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2619)
 at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:809)
 at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:266)
 at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:252)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 at org.apache.hadoop.fs.s3native.$Proxy7.retrievePair(Unknown Source)
 at org.apache.hadoop.fs.s3native.NativeS3FileSystem.open(NativeS3FileSystem.java:1002)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:420)
 at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
 at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
 at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:331)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:292)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:406)
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:549)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:199)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:423)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)

2013-03-11 11:33:17,315 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-03-11 11:33:17,452 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-03-11 11:33:17,453 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Got UserName hadoop for UID 106 from the native implementation
2013-03-11 11:33:17,456 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.io.IOException: java.lang.reflect.InvocationTargetException
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:345)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:292)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:406)
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:549)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:199)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:423)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:331)
 ... 11 more
Caused by: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2981A88871482E0B, AWS Error Code: InvalidObjectState, AWS Error Message: The operation is not valid for the object's storage class, S3 Extended Request ID: 7yHCAsfLHZKbf+5ojWMS9q2B41bBwizoqW4zldSdfEpLoLrKvI9X1WI0brHGsC49
 at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:548)
 at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:288)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2619)
 at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:809)
 at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:266)
 at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:252)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 at org.apache.hadoop.fs.s3native.$Proxy7.retrievePair(Unknown Source)
 at org.apache.hadoop.fs.s3native.NativeS3FileSystem.open(NativeS3FileSystem.java:1002)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:420)
 at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
 at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
 at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
 ... 16 more
2013-03-11 11:33:17,500 INFO org.apache.hadoop.mapred.Task (main): Runnning cleanup for the task


内容をみると、
Error Message: The operation is not valid for the object's storage class, S3 Extended Request ID
となっており、ストレージクラスが良くないと言われています。

EMRの入力ログバケットを見ると、Glaceirになっているクラスがありました。
EMRはGlaceirファイルは処理できないので、エラーになっていたようです。

EMRでワケの分からないエラーが出たらEMRログのS3バケットを見るようにしよう。
以上です。