反向和正向DNS设置正确，但有时MapReduce作业失败

自从我们将集群切换到通过专用接口进行通信并创build了具有正确的正向和反向查找区域的DNS服务器之后，在M / R作业运行之前，我们会收到此消息：

ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormatBase - Cannot resolve the host name for /192.168.3.9 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '9.3.168.192.in-addr.arpa'

dig和nslookup都显示反向查找和前向查找都可以在集群内部没有错误的情况下得到很好的响应。

这些信息不久之后，这个工作就开始了……但是每隔一段时间我们都会得到一个NPE：

Exception in thread "main" java.lang.NullPointerException INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.net.DNS.reverseDns(DNS.java:93) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.reverseDNS(TableInputFormatBase.java:219) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:184) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) INFO app.insights.search.SearchIndexUpdater - at java.security.AccessController.doPrivileged(Native Method) INFO app.insights.search.SearchIndexUpdater - at javax.security.auth.Subject.doAs(Subject.java:415) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) INFO app.insights.search.SearchIndexUpdater - at app.insights.search.correlator.comments.CommentCorrelator.main(CommentCorrelator.java:72

是否有其他人在DNS服务器的专用networking上设置了CDH Hadoop集群？

CDH 4.3.1与MR1 2.0.0和HBase 0.94.6

您的内部DNS服务器可能没有足够快速地响应您的Hadoop环境中的请求数（取决于它的大小）。

你可以做几件事情之一：

设置只处理Hadoop集群请求的仅caching名称服务器。您需要在每个主机的/etc/resolv.conf中的任何其他名称服务器之前configuration该名称服务器。
启用nscd在hadoop集群中运行的每台服务器上执行短期主机名查找caching。
编辑Hadoop集群中每个服务器上的/ etc / hosts，以包含集群中每个服务器的每个IP /主机名对的完整列表。

设置一个只caching的域名服务器是相当简单的。你应该能够find适合你的操作系统的适当的教程，只需要一点search。

设置nscd也相当微不足道，有时候会导致不可思议的事情发生（比如主机名变化花费的时间比预期的要长）。如果足够短的caching时间，这对我们来说不是问题。我build议禁用nscd可以启用的passwd和组caching。 caching时间不需要很长。 600秒似乎对我们的集群来说是一个很好的平衡，并且显着减less了实际的DNS查询。即使是60秒也会比重复点击DNS服务器更好。

我的configuration文件如下所示：

  logfile /var/log/nscd.log threads 6 max-threads 128 server-user nscd # stat-user nocpulse debug-level 0 # reload-count 5 paranoia no # restart-interval 3600 enable-cache passwd no positive-time-to-live passwd 600 negative-time-to-live passwd 20 suggested-size passwd 211 check-files passwd yes persistent passwd yes shared passwd yes max-db-size passwd 33554432 auto-propagate passwd yes enable-cache group no positive-time-to-live group 3600 negative-time-to-live group 60 suggested-size group 211 check-files group yes persistent group yes shared group yes max-db-size group 33554432 auto-propagate group yes enable-cache hosts yes positive-time-to-live hosts 600 negative-time-to-live hosts 20 suggested-size hosts 211 check-files hosts yes persistent hosts yes shared hosts yes max-db-size hosts 33554432

最后，去/ etc / hosts路由：如果你有一个大的集群，我不会推荐这个。确保所有configuration都是最新的，这太贵了。