Polars中的半连接与Postgresql的半连接
最近在了解Polars库的Join操作时, 看到一个semi-join 的概念就是半连接. 所谓半连接是指当一张表在另一张表找到匹配的记录后, 仅返回第一张表中的记录, 与条件连接不同的是, 就算在右表中找到了若干条匹配的记录, 也仅仅会返回左表中的一条记录, 而右表的记录不会返回! 换句话说半连接通常是in 或者exists这样的查询, 通常结构如下:
select * from outer_tables where exists(select 1 from inner_table where outer_tables.xxx=inner_table.xxx) and ...
或者
select * from outer_tables where expr in (select ... from inner_table ...) and ...
这种查询的特点是我们只关心outer_table中与inner_table中相关字段相匹配的记录。换句话说,最后的结果集是在outer_tables中的,而semi-join的作用只是对outer_tables中的记录进行筛选。这也是我们进行半连接优化的基础,即通过半连接从inner_table中获取到最少量的足以对outer_tables记录进行筛选的信息就足够了。
下面用PostgreSQL中的递归来模拟半连接
create table a(id int, info text, ts timestamptz);
--用于创建一个跟表a具有相同表结构的表
create table b as select * from a where 1=0 --另一个优雅点的写法是: create table b(like a);
--构造大量模拟数据插入表a
insert into a
select id,md5(random()::text) as info,now() as ts from generate_series(0, 999999) as t(id);
insert into b
select random()*10 as id ,md5(random()::text) as info,now() as ts from generate_series(0, 999999) as t(id);
--表a跟表b分别创建索引
create index on a(id);
create index on b(id);
注意: random()函数用于返回一个0到1的随机浮点数, 所以b中的1000000条记录中含有大量的重复id, 对于这种情况, 我们要查询在b中也存在的a的记录, 就可以用半连接加速
先看下普通的exists查询的效率(在我本机上的结果):
select * from a where exists(select 1 from b where a.id=b.id)
explain anlyze select * from a where exists(select 1 from b where a.id=b.id)
查询计划结果:
Merge Join (cost=20901.17..20901.71 rows=11 width=45) (actual time=226.269..226.278 rows=11 loops=1)
Merge Cond: (a.id = b.id)
-> Index Scan using a_id_idx on a (cost=0.42..35329.44 rows=1000000 width=45) (actual time=0.013..0.017 rows=12 loops=1)
-> Sort (cost=20900.74..20900.77 rows=11 width=4) (actual time=226.251..226.252 rows=11 loops=1)
Sort Key: b.id
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=20900.44..20900.55 rows=11 width=4) (actual time=226.191..226.192 rows=11 loops=1)
Group Key: b.id
Batches: 1 Memory Usage: 24kB
-> Index Only Scan using b_id_idx on b (cost=0.42..18400.44 rows=1000000 width=4) (actual time=0.023..88.011 rows=1000000 loops=1)
Heap Fetches: 0
Planning Time: 0.984 ms
Execution Time: 226.501 ms
结果不算太好, 现在用PostgreSQL中的递归查询间接实现半连接的效果:
select * from a where a.id in(
with recursive tmp as(
select min(id) as id from b
union all
select (select min(b.id) from b where b.id>tmp.id) from tmp where tmp.id is not null
)select id from tmp where tmp.id is not null
)
查看下查询计划:
explain analyze
select * from a where a.id in(
with recursive tmp as(
select min(id) as id from b
union all
select (select min(b.id) from b where b.id>tmp.id) from tmp where tmp.id is not null
)select id from tmp where tmp.id is not null
)
查询计划结果:
Nested Loop (cost=54.01..895.84 rows=100 width=45) (actual time=0.223..0.244 rows=11 loops=1)
-> HashAggregate (cost=53.59..54.59 rows=100 width=4) (actual time=0.170..0.174 rows=11 loops=1)
Group Key: tmp.id
Batches: 1 Memory Usage: 24kB
-> CTE Scan on tmp (cost=50.32..52.34 rows=100 width=4) (actual time=0.058..0.165 rows=11 loops=1)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 1
CTE tmp
-> Recursive Union (cost=0.45..50.32 rows=101 width=4) (actual time=0.056..0.160 rows=12 loops=1)
-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.055..0.056 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.052..0.053 rows=1 loops=1)
-> Index Only Scan using b_id_idx on b b_1 (cost=0.42..20900.44 rows=1000000 width=4) (actual time=0.051..0.052 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on tmp tmp_1 (cost=0.00..4.78 rows=10 width=4) (actual time=0.008..0.008 rows=1 loops=12)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 2
-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)
InitPlan 1 (returns $3)
-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)
-> Index Only Scan using b_id_idx on b (cost=0.42..7803.11 rows=333334 width=4) (actual time=0.007..0.007 rows=1 loops=11)
Index Cond: ((id IS NOT NULL) AND (id > tmp_1.id))
Heap Fetches: 0
-> Index Scan using a_id_idx on a (cost=0.42..8.40 rows=1 width=45) (actual time=0.006..0.006 rows=1 loops=11)
Index Cond: (id = tmp.id)
Planning Time: 0.630 ms
Execution Time: 0.487 ms
效果非常明显了: 226.501 ms VS 0.487 ms
这就是半连接的威力!
回到最开始Polars库中的join, 要实现半连接非常简单, 给join函数的how关键字参数指定为'semi' 就可以实现高效的半连接:
import polars as pl
result = a.join(b, on='id', how='semi')
这里面on关键字用于指定连接的字段, how='semi'用于表示是半连接