灯火互联
管理员
管理员
  • 注册日期2011-07-27
  • 发帖数41778
  • QQ
  • 火币41290枚
  • 粉丝1086
  • 关注100
  • 终身成就奖
  • 最爱沙发
  • 忠实会员
  • 灌水天才奖
  • 贴图大师奖
  • 原创先锋奖
  • 特殊贡献奖
  • 宣传大使奖
  • 优秀斑竹奖
  • 社区明星
阅读:2922回复:0

[mysql]千万级别mysql合并表快速去重简析

楼主#
更多 发布于:2012-09-01 02:19

千万级别mysql合并表快速去重简析

mysql合并表去重
目标:
现有表a和b,把两个表中的数据合并去重到c表中。其中a和b表中数据量大概在2千万左右。
基本情况
操作系统版本:CentOS release 5.6 64位
操作系统内存:8G
数据库版本:5.1.56-community 64位
数据库初始化参数:默认

数据库表和数据量
表a: www.atcpu.com
mysql> desc a2kw;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| c1    | varchar(20) | YES  | MUL | NULL    |       |
| c2    | varchar(30) | YES  |     | NULL    |       |
| c3    | varchar(12) | YES  |     | NULL    |       |
| c4    | varchar(20) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
表b
mysql> desc b2kw;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| c1    | varchar(20) | YES  |     | NULL    |       |
| c2    | varchar(30) | YES  |     | NULL    |       |
| c3    | varchar(12) | YES  |     | NULL    |       |
| c4    | varchar(20) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

a和b表的数据概况如下
mysql> select * from a2kw limit 10;
+-----------+-----------+------+----------+
| c1        | c2        | c3   | c4       |
+-----------+-----------+------+----------+
| 662164461 | 131545534 | TOM0 | 20120520 |
| 226662142 | 605685564 | TOM0 | 20120516 |
| 527008225 | 172557633 | TOM0 | 20120514 |
| 574408183 | 350897450 | TOM0 | 20120510 |
| 781619324 | 583989494 | TOM0 | 20120510 |
| 158872754 | 775676430 | TOM0 | 20120512 |
| 815875622 | 631631832 | TOM0 | 20120514 |
| 905943640 | 477433083 | TOM0 | 20120514 |
| 660790641 | 616774715 | TOM0 | 20120512 |
| 999083595 | 953186525 | TOM0 | 20120513 |
+-----------+-----------+------+----------+
10 rows in set (0.01 sec)

基本步骤 www.atcpu.com
1、在B表上创建索引
mysql> select count(*) from b2kw;
+----------+
| count(*) |
+----------+
| 20000002 |
+----------+
1 row in set (0.00 sec)
mysql> create index ind_b2kw_c1 on  b2kw(c1);
Query OK, 20000002 rows affected (1 min 2.94 sec)
Records: 20000002  Duplicates: 0  Warnings: 0
数据量为:20000002 ,时间为:1 min 2.94 sec
2、把a、b分别插入中间表temp表中

创建中间表
mysql> create table temp  select * from c2kw where 1=2;
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0
插入数据
mysql> insert into temp  select * from a2kw;
Query OK, 20000002 rows affected (13.23 sec)
Records: 20000002  Duplicates: 0  Warnings: 0
mysql> insert into temp  select * from b2kw;
Query OK, 20000002 rows affected (13.27 sec)
Records: 20000002  Duplicates: 0  Warnings: 0
  www.atcpu.com
mysql> select count(*) from temp;
+----------+
| count(*) |
+----------+
| 40000004 |
+----------+
1 row in set (0.00 sec)
数据量为:40000004 ,时间为:26.50 sec
3、temp建立联合索引,强制索引去掉重复数据
mysql> create index ind_temp_c123 on temp(c1,c2,c3);
Query OK, 40000004 rows affected (3 min 43.87 sec)
Records: 40000004  Duplicates: 0  Warnings: 0
查看执行计划
mysql> explain select c1,c2,c3,max(c4) from temp FORCE INDEX
(ind_temp_c123) group by c1,c2,c3 ;
+----+-------------+-------+-------+---------------+----------
-----+---------+------+----------+-------+
| id | select_type | table | type  | possible_keys | key      
    | key_len | ref  | rows     | Extra |
+----+-------------+-------+-------+---------------+-------------
--+---------+------+----------+-------+
|  1 | SIMPLE      | temp  | index | NULL          | ind_temp_c123 | 71  
   | NULL | 40000004 |       |
+----+-------------+-------+ www.atcpu.com-------+---------------+--------
-------+---------+------+----------+-------+
1 row in set (0.05 sec)

mysql> insert into c2kw select c1,c2,c3,max(c4) from temp
FORCE INDEX (ind_temp_c123) group by c1,c2,c3 ;
Query OK, 20000004 rows affected (2 min 0.85 sec)
Records: 20000004  Duplicates: 0  Warnings: 0
实际大约花费实际为:6 min

4、删除中间表
mysql> drop table temp;
Query OK, 0 rows affected (0.99 sec)
实际大约花费实际为:1 sec

5、建立c索引
mysql> create index ind_c2kw_c1 on c2kw(c1);
Query OK, 20000004 rows affected (49.74 sec)
Records: 20000004  Duplicates: 0  Warnings: 0
mysql> create index ind_c2kw_c2 on c2kw(c2);
Query OK, 20000004 rows affected (1 min 47.20 sec)
Records: 20000004  Duplicates: 0  Warnings: 0
mysql> create index ind_c2kw_c3 on c2kw(c3);
Query OK, 20000004 rows affected (2 min 42.02 sec)
Records: 20000004  Duplicates: 0  Warnings: 0
实际大约花费实际为:5分钟 www.atcpu.com

6、清空a、b表
mysql> truncate table a2kw;
Query OK, 0 rows affected (1.15 sec)
mysql> truncate table b2kw;
Query OK, 0 rows affected (1.34 sec)
实际大约花费实际为:3sec

一共花费的时间大概在15


喜欢0 评分0
游客

返回顶部