一个ColumnFamily只在四个节点中的三个节点上放置数据

我已经发布在cassandra用户邮件列表上，但是还没有得到任何答复，我想知道是否有人在这里serverfault.com会有任何想法。

与Cassandra相比，我似乎遇到了相当奇怪的（至less对我来说）问题/行为。

我正在Cassandra 0.8.7上运行一个4节点的集群。对于有问题的密钥空间，我有RF = 3，SimpleStrategy与KeySpace中的多个ColumnFamilies。然而，ColumnFamilies似乎只有4个节点中的3个分布了数据。

有问题的ColumnFamily旁边的集群上的数据似乎或多或less是相同的。

# nodetool -h localhost ring Address DC Rack Status State Load Owns Token 127605887595351923798765477786913079296 192.168.81.2 datacenter1 rack1 Up Normal 7.27 GB 25.00% 0 192.168.81.3 datacenter1 rack1 Up Normal 7.74 GB 25.00% 42535295865117307932921825928971026432 192.168.81.4 datacenter1 rack1 Up Normal 7.38 GB 25.00% 85070591730234615865843651857942052864 192.168.81.5 datacenter1 rack1 Up Normal 7.32 GB 25.00% 127605887595351923798765477786913079296

键空间相关位的模式如下：

 [default@A] show schema; create keyspace A with placement_strategy = 'SimpleStrategy' and strategy_options = [{replication_factor : 3}]; [...] create column family UserDetails with column_type = 'Standard' and comparator = 'IntegerType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and memtable_operations = 0.571875 and memtable_throughput = 122 and memtable_flush_after = 1440 and rows_cached = 0.0 and row_cache_save_period = 0 and keys_cached = 200000.0 and key_cache_save_period = 14400 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and row_cache_provider = 'ConcurrentLinkedHashCacheProvider';

现在症状 – 在每个节点上输出“nodetool -h localhost cfstats”。请注意node1上的数字。

节点1

 Column Family: UserDetails SSTable count: 0 Space used (live): 0 Space used (total): 0 Number of Keys (estimate): 0 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 0 Key cache hit rate: NaN Row cache: disabled Compacted row minimum size: 0 Compacted row maximum size: 0 Compacted row mean size: 0

节点2

 Column Family: UserDetails SSTable count: 3 Space used (live): 112952788 Space used (total): 164953743 Number of Keys (estimate): 384 Memtable Columns Count: 159419 Memtable Data Size: 74910890 Memtable Switch Count: 59 Read Count: 135307426 Read Latency: 25.900 ms. Write Count: 3474673 Write Latency: 0.040 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 120 Key cache hit rate: 0.999971684189041 Row cache: disabled Compacted row minimum size: 42511 Compacted row maximum size: 74975550 Compacted row mean size: 42364305

节点3

 Column Family: UserDetails SSTable count: 3 Space used (live): 112953137 Space used (total): 112953137 Number of Keys (estimate): 384 Memtable Columns Count: 159421 Memtable Data Size: 74693445 Memtable Switch Count: 56 Read Count: 135304486 Read Latency: 25.552 ms. Write Count: 3474616 Write Latency: 0.036 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 109 Key cache hit rate: 0.9999716840888175 Row cache: disabled Compacted row minimum size: 42511 Compacted row maximum size: 74975550 Compacted row mean size: 42364305

节点4

 Column Family: UserDetails SSTable count: 3 Space used (live): 117070926 Space used (total): 119479484 Number of Keys (estimate): 384 Memtable Columns Count: 159979 Memtable Data Size: 75029672 Memtable Switch Count: 60 Read Count: 135294878 Read Latency: 19.455 ms. Write Count: 3474982 Write Latency: 0.028 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 119 Key cache hit rate: 0.9999752235777154 Row cache: disabled Compacted row minimum size: 2346800 Compacted row maximum size: 62479625 Compacted row mean size: 42591803

当我到node1上的'data'目录时，没有关于UserDetails ColumnFamily的文件。

我试图进行手动修复，希望能够治愈这种情况，但没有任何运气。

 # nodetool -h localhost repair A UserDetails INFO 15:19:54,611 Starting repair command #8, repairing 3 ranges. INFO 15:19:54,647 Sending AEService tree for #<TreeRequest manual-repair-89c1acb0-184c-438f-bab8-7ceed27980ec, /192.168.81.2, (A,UserDetails), (85070591730234615865843651857942052864,127605887595351923798765477786913079296]> INFO 15:19:54,742 Endpoints /192.168.81.2 and /192.168.81.3 are consistent for UserDetails on (85070591730234615865843651857942052864,127605887595351923798765477786913079296] INFO 15:19:54,750 Endpoints /192.168.81.2 and /192.168.81.5 are consistent for UserDetails on (85070591730234615865843651857942052864,127605887595351923798765477786913079296] INFO 15:19:54,751 Repair session manual-repair-89c1acb0-184c-438f-bab8-7ceed27980ec (on cfs [Ljava.lang.String;@3491507b, range (85070591730234615865843651857942052864,127605887595351923798765477786913079296]) completed successfully INFO 15:19:54,816 Sending AEService tree for #<TreeRequest manual-repair-6d2438ca-a05c-4217-92c7-c2ad563a92dd, /192.168.81.2, (A,UserDetails), (42535295865117307932921825928971026432,85070591730234615865843651857942052864]> INFO 15:19:54,865 Endpoints /192.168.81.2 and /192.168.81.4 are consistent for UserDetails on (42535295865117307932921825928971026432,85070591730234615865843651857942052864] INFO 15:19:54,874 Endpoints /192.168.81.2 and /192.168.81.5 are consistent for UserDetails on (42535295865117307932921825928971026432,85070591730234615865843651857942052864] INFO 15:19:54,874 Repair session manual-repair-6d2438ca-a05c-4217-92c7-c2ad563a92dd (on cfs [Ljava.lang.String;@7e541d08, range (42535295865117307932921825928971026432,85070591730234615865843651857942052864]) completed successfully INFO 15:19:54,909 Sending AEService tree for #<TreeRequest manual-repair-98d1a21c-9d6e-41c8-8917-aea70f716243, /192.168.81.2, (A,UserDetails), (127605887595351923798765477786913079296,0]> INFO 15:19:54,967 Endpoints /192.168.81.2 and /192.168.81.3 are consistent for UserDetails on (127605887595351923798765477786913079296,0] INFO 15:19:54,974 Endpoints /192.168.81.2 and /192.168.81.4 are consistent for UserDetails on (127605887595351923798765477786913079296,0] INFO 15:19:54,975 Repair session manual-repair-98d1a21c-9d6e-41c8-8917-aea70f716243 (on cfs [Ljava.lang.String;@48c651f2, range (127605887595351923798765477786913079296,0]) completed successfully INFO 15:19:54,975 Repair command #8 completed successfully

在我使用SimpleStrategy的时候，我会希望键或多或less的在节点上平均分配，但是这似乎并不是这样。

有没有人遇到过类似的行为？有没有人有任何build议，我可以做些什么来把一些数据到node1？显然，这种数据拆分意味着node2，node3和node4需要做所有不理想的读取工作。

任何build议非常感谢。

亲切的问候，巴特

SimpleStrategy意味着Cassandra在不考虑机架，数据中心或其他地理位置的情况下分发数据。这是理解数据分布的重要信息，但仅仅是完全分析你的情况是不够的。

如果您想了解如何将行分布到集群中，那么也是您使用的分区程序的问题。在决定应该拥有它们的集群成员之前，随机分区器散列行密钥。保留顺序的分区程序不会在群集上创build热点（包括完全不使用节点！），即使您的节点具有相等的环分区也是如此。您可以尝试Cassandra如何在您的一个节点上使用以下命令分配不同的密钥，以查看Cassandra认为不同的密钥（实际的或假设的）属于哪个节点的位置：

 nodetool -h localhost getendpoints <keyspace> <cf> <key>

如果其他列族正在通过群集正确地分发他们的数据，我将查看您正在使用的分区器和密钥。

出来是模式的问题 – 而不是有多行（每行用户1行），我们有一个巨大的行超过800.000列。

我怀疑发生的是：

这行一直是由操作系统cachingcaching – 因此我们没有看到任何IO
然后Cassandra使用所有的CPU时间来重复序列化大量的行，以获取数据

我们已经改变了应用程序这样做的方式，即它为单个用户的细节存储单行，问题已经消失。