Amazon EC2上的高iowait具有EBS卷的MySQL实例

我们有一台MySQL服务器运行在Amazon EC2 c1.medium实例上，依靠单个EBS卷使用ext3文件系统进行存储。

这个MySQL服务器被运行在一些Web服务器上的几个应用程序查询〜500 / ps，这些服务器也在Amazon EC2上。

正如你在下面看到的，服务器的负载平均和处理器闲置时间看起来不错，但是现在有一些令人担忧和担心的事情，这是它正在经历的高昂的爱荷华州。

另一个令我担忧的数字是iostat每秒传输速度（tps）的数量，大部分时间都在450以上。在对这个话题进行了一些研究之后，我看到一些人说这是一个EBS量太多的问题： https : //forums.aws.amazon.com/thread.jspa? threadID =30769

顺便说一下，下面看到的命令输出在高峰时间没有被捕获。这就是服务器行为/执行大部分时间的方式。

那么，所有人都说，这里去我的问题：

1-是时候考虑移动RAID架构（我会说RAID 0）？

我应该花时间在一个集群解决scheme，如MySQL集群吗？

3-你认为这种情况严重影响我们的应用程序吗？如果我们迁移到RAID 0和/或集群解决scheme，它们的性能会好吗？（看起来应用程序到目前为止是快乐的，但他们会更快乐吗？）

请让我知道，如果你需要任何进一步的信息。

~ # uptime 12:34:14 up 2 days, 4:06, 1 user, load average: 2.24, 1.90, **1.84** ######################################################## ~ # vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ rb swpd free buff cache si so bi bo in cs us sy id **wa** st 0 1 52 11168 16420 1498728 0 0 4586 231 11 81 6 3 52 39 0 2 1 52 10460 16320 1499588 0 0 11631 397 3194 4319 10 4 47 39 0 4 1 52 11448 16064 1499156 0 0 12231 592 2301 3331 9 5 50 36 0 4 0 52 10328 16068 1500176 0 0 8578 392 2131 2745 8 6 49 37 0 0 1 52 11164 15732 1499928 0 0 9604 578 2609 3510 7 4 49 40 0 0 1 52 10824 15768 1499836 0 0 5038 634 1912 2509 8 3 47 42 0 3 1 52 12040 15888 1498096 0 0 5068 204 1927 2531 10 8 45 37 0 8 2 52 11252 15784 1499272 0 0 8521 390 2437 3100 14 15 39 31 0 1 2 52 11436 15724 1499748 0 0 8287 401 2159 3113 11 10 42 36 1 0 1 52 12016 15704 1498752 0 0 11576 499 3324 3984 16 17 31 36 0 1 1 52 10536 15664 1500508 0 0 8430 718 2686 3265 15 14 37 34 0 1 1 52 10300 15676 1500744 0 0 10186 720 2488 3488 16 5 45 34 0 ######################################################## ~ # iostat -dm 5 /dev/sdf Linux 2.6.21.7-2.fc8xen (database-new) 01/20/12 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 464.81 8.84 0.33 1658860 61390 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 402.20 7.39 0.43 36 2 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 431.40 7.74 0.32 38 1 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 461.40 8.26 0.39 41 1 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 475.65 9.20 0.29 46 1 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 534.80 9.82 0.52 49 2 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdf 526.60 9.97 0.52 49 2 ######################################################## ~ # iostat -mdx 5 /dev/sdf Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 22.21 46.28 427.47 37.54 8.84 0.33 40.38 1.78 3.82 1.72 79.87 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 22.36 80.04 450.30 60.48 9.29 0.55 39.44 1.45 2.85 1.58 80.48 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 23.40 43.60 370.60 47.00 7.75 0.35 39.76 1.45 3.47 1.97 82.08 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 20.20 33.20 382.60 29.60 8.02 0.25 41.05 1.31 3.17 2.11 87.12 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 28.80 35.20 422.40 33.40 9.04 0.27 41.80 1.45 3.19 1.95 88.96 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 14.20 45.00 291.80 51.40 5.97 0.38 37.86 1.45 4.22 2.50 85.68 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 19.16 56.89 535.33 41.32 11.44 0.38 42.00 1.49 2.59 1.53 88.46 Device: rrqm/s wrqm/sr/sw/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdf 20.40 81.40 233.00 64.40 4.86 0.57 37.39 1.74 5.84 3.18 94.72

################################################## my.cnf中

 [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql long_query_time=1 key_buffer = 64M thread_cache_size = 30 table_cache = 1024 table_definition_cache = 512 query_cache_type = 1 query_cache_size = 64M tmp_table_size = 64M max_heap_table_size = 64M innodb_buffer_pool_size = 512M old_passwords=1 max_connections=400 wait_timeout=30 [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid [ndbd] connect-string="nodeid=2;host=localhost:1186" [ndb_mgm] connect-string="host=localhost:1186"

##################################################杂项的调整脚本输出

 ~ # ./tuning-primer.sh -- MYSQL PERFORMANCE TUNING PRIMER -- - By: Matthew Montgomery - MySQL Version 5.1.52 i686 Uptime = 0 days 1 hrs 1 min 1 sec Avg. qps = 517 Total Questions = 1894942 Threads Connected = 94 Warning: Server has not been running for at least 48hrs. It may not be safe to use these recommendations To find out more information on how each of these runtime variables effects performance visit: http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html Visit http://www.mysql.com/products/enterprise/advisors.html for info about MySQL's Enterprise Monitoring and Advisory Service SLOW QUERIES The slow query log is NOT enabled. Current long_query_time = 1.000000 sec. You have 207 out of 1894981 that take longer than 1.000000 sec. to complete Your long_query_time seems to be fine BINARY UPDATE LOG The binary update log is NOT enabled. You will not be able to do point in time recovery See http://dev.mysql.com/doc/refman/5.1/en/point-in-time-recovery.html WORKER THREADS Current thread_cache_size = 30 Current threads_cached = 8 Current threads_per_sec = 0 Historic threads_per_sec = 0 Your thread_cache_size is fine MAX CONNECTIONS Current max_connections = 400 Current threads_connected = 93 Historic max_used_connections = 195 The number of used connections is 48% of the configured maximum. Your max_connections variable seems to be fine. INNODB STATUS Current InnoDB index space = 1.33 G Current InnoDB data space = 5.04 G Current InnoDB buffer pool free = 0 % Current innodb_buffer_pool_size = 512 M Depending on how much space your innodb indexes take up it may be safe to increase this value to up to 2 / 3 of total system memory MEMORY USAGE Max Memory Ever Allocated : 1.13 G Configured Max Per-thread Buffers : 1.04 G Configured Max Global Buffers : 642 M Configured Max Memory Limit : 1.67 G Physical Memory : 1.70 G Max memory limit exceeds 90% of physical memory KEY BUFFER Current MyISAM index space = 379 M Current key_buffer_size = 64 M Key cache miss rate is 1 : 162 Key buffer free ratio = 80 % Your key_buffer_size seems to be fine QUERY CACHE Query cache is enabled Current query_cache_size = 64 M Current query_cache_used = 43 M Current query_cache_limit = 1 M Current Query cache Memory fill ratio = 67.44 % Current query_cache_min_res_unit = 4 K MySQL won't cache query results that are larger than query_cache_limit in size SORT OPERATIONS Current sort_buffer_size = 2 M Current read_rnd_buffer_size = 256 K Sort buffer seems to be fine JOINS Current join_buffer_size = 132.00 K You have had 4013 queries where a join could not use an index properly You should enable "log-queries-not-using-indexes" Then look for non indexed joins in the slow query log. If you are unable to optimize your queries you may want to increase your join_buffer_size to accommodate larger joins in one pass. Note! This script will still suggest raising the join_buffer_size when ANY joins not using indexes are found. OPEN FILES LIMIT Current open_files_limit = 2458 files The open_files_limit should typically be set to at least 2x-3x that of table_cache if you have heavy MyISAM usage. Your open_files_limit value seems to be fine TABLE CACHE Current table_open_cache = 1024 tables Current table_definition_cache = 512 tables You have a total of 45237 tables You have 1024 open tables. Current table_cache hit rate is 0% , while 100% of your table cache is in use You should probably increase your table_cache You should probably increase your table_definition_cache value. TEMP TABLES Current max_heap_table_size = 64 M Current tmp_table_size = 64 M Of 38723 temp tables, 44% were created on disk Perhaps you should increase your tmp_table_size and/or max_heap_table_size to reduce the number of disk-based temporary tables Note! BLOB and TEXT columns are not allow in memory tables. If you are using these columns raising these values might not impact your ratio of on disk temp tables. TABLE SCANS Current read_buffer_size = 128 K Current table scan ratio = 537 : 1 read_buffer_size seems to be fine TABLE LOCKING Current Lock Wait ratio = 1 : 954 You may benefit from selective use of InnoDB. If you have long running SELECT's against MyISAM tables and perform frequent updates consider setting 'low_priority_updates=1' If you have a high concurrency of inserts on Dynamic row-length tables consider setting 'concurrent_insert=2'.

如果您发布了my.cnf，以及是否使用InnoDB或MyISAM表，以及是否读取繁重或写入繁重的工作负载，这将有所帮助。否则，我们只是猜测。这是我的：

首先，我会查看并确保您的查询已正确索引。 MySQL数据库的高I / O是由非常高的并发性，调整不好的服务器引起的，或者是由于性能不佳的查询必须执行全表或索引扫描。关于如何查找性能不佳的查询的一些提示可以在我在Ideeli技术博客上的文章中find 。

检查你的my.cnf。如果您使用InnoDB，请确保innodb_buffer_pool_size和innodb_log_file_size足够大。因为EBS具有这样的可变延迟，所以最大化innodb_log_file_size可以具有显着的性能优点。如果你正在使用MyISAM（你不应该），确保你的key_buffer大小足够大。

如果您确信您的查询得到了充分优化，并且您的服务器已经很好地调整好了，那么我们可以继续下一个项目。 ext3对于数据库来说并不理想。其中一个主要原因是ext3只允许一个线程同时更新一个inode（试图find这个文档）。如果你没有运行innodb-file-per-table，这意味着在ibdata文件上有很多的文件系统争用。 xfs没有这个限制，并且已经被certificate对数据库工作负载的性能要好得多（需要源代码）。

如果您不能更改为xfs，请确保您正在使用innodb-file-per-table，并且至less要确保在mount上有noatime，nodiratime。

接下来，在您的实例大小。对于大多数数据库来说，c1.medium不是一个理想的实例大小，除非数据集很小。 MySQL通常会从内存中获益超过计算能力。 c1.medium只有1.7GB的RAM！你的数据集有多大？一般来说，一个m1.large（有7.5GB的内存）将超越c1.medium，除非极less数情况。这也是价格的两倍，每小时0.34美元。

现在到EBS卷的RAID。是的，RAID会大大增加你的IOPS。（如将增加您的实例大小）。 不要RAID0 …如果你关心你的数据，至less。我已经在很多地方对此进行了解释，包括在我的博客上，在2011年的Percona Live NYC上发表演讲，在服务器上发表演讲。简短的说法是，EBS卷以非典型的方式失败，并且能够从集合中删除卷已经被certificate在多种场合中是有价值的，最显着的是在2011年EBS出现故障的情况下，一些网站在几天内脱机了。尽pipe有数十个受EBS问题影响的实例，但我们在凌晨4点离线了45分钟。

以下是使用MySQL的RAID EBS卷的一些基准。

最后， Percona Server具有大量的可伸缩性优化。以下是关于我公司从MySQL切换到Percona服务器的经验的白皮书。我们每天都遇到数据库停顿和中断。从MySQL简单地切换到Percona Server解决了这个问题，因为一些可伸缩性的改进。

所以，总结…

调整您的查询
调整您的服务器
让自己更好“硬件”
使用xfs，而不是ext3
RAID10，而不是RAID0
从MySQL切换到Percona服务器

至于MySQL Cluster，它是一个完全不同于MySQL的动物，通常不适合大多数OLTP应用程序。 Galera / Percona XtraDB集群也是新的有趣的集群产品。然而，在你遇到任何这个问题之前，你有很多select。我们在单个m2.4xlarge的EC2中使用RAID10提供24k QPS。

祝你好运！

这是许多公司遇到的问题，并且在各种在线论坛上对其解决scheme进行了相当充分的讨论。

通常为了增加潜在的iops，两个或更多的EBS卷在RAID0arrays中连接在一起。不过，这并不是没有风险的。正如您所知道的，使用RAID0时，只需要其中一个成员EBS卷出现问题，您的数据就会被烘烤。因此，您可能会考虑使用更具弹性的RAID级别。

3-你认为这种情况严重影响我们的应用程序吗？如果我们迁移到RAID 0和/或集群解决scheme，它们的性能会好吗？

由于您正在运行SQL Server，因此查看SQL Server度量标准将更有意义，以了解是否快速提供查询。看看你的一位数的平均请求等待时间（等待），我不认为I / O将是一个很大的问题呢。

另外，正如你所看到的主要是读取负载，你可以通过更大的caching/增加RAM的数量和调整你的MySQL实例的caching参数来减less它。我希望这会比将存储更改为处理更多I / O的性能影响大得多。

由于500gps是一个相当温和的sql服务器上的负载，我build议看磁盘上创build的临时表的百分比，并开始优化您的查询和MySQL服务器设置。

1，不要做Raid0的方法，最终会失败，你会后悔的。

2，不，在这么less的每秒查询次数下，你不需要MySQL集群。

3，是的，它肯定会影响应用程序性能，衡量您可以启用缓慢的日志，看看自己。

目前mysql使用多less内存，还有剩余空间吗？
如果没有，你应该考虑切换到一个更大的实例，并开始使用杂项的mysql调优脚本优化设置：
http://www.day32.com/MySQL/tuning-primer.sh