Polars中的半连接与Postgresql的半连接

最近在了解Polars库的Join操作时, 看到一个semi-join 的概念就是半连接. 所谓半连接是指当一张表在另一张表找到匹配的记录后, 仅返回第一张表中的记录, 与条件连接不同的是, 就算在右表中找到了若干条匹配的记录, 也仅仅会返回左表中的一条记录, 而右表的记录不会返回! 换句话说半连接通常是in 或者exists这样的查询, 通常结构如下:

select * from outer_tables where exists(select 1 from inner_table where outer_tables.xxx=inner_table.xxx) and ...

或者

select * from outer_tables where expr in (select ... from inner_table ...) and ...

这种查询的特点是我们只关心outer_table中与inner_table中相关字段相匹配的记录。换句话说，最后的结果集是在outer_tables中的，而semi-join的作用只是对outer_tables中的记录进行筛选。这也是我们进行半连接优化的基础，即通过半连接从inner_table中获取到最少量的足以对outer_tables记录进行筛选的信息就足够了。

下面用PostgreSQL中的递归来模拟半连接

create table a(id int, info text, ts timestamptz);

--用于创建一个跟表a具有相同表结构的表

create table b as select * from a where 1=0 --另一个优雅点的写法是: create table b(like a);

--构造大量模拟数据插入表a

insert into a

select id,md5(random()::text) as info,now() as ts from generate_series(0, 999999) as t(id);

insert into b

select random()*10 as id ,md5(random()::text) as info,now() as ts from generate_series(0, 999999) as t(id);

--表a跟表b分别创建索引

create index on a(id);

create index on b(id);

注意: random()函数用于返回一个0到1的随机浮点数, 所以b中的1000000条记录中含有大量的重复id, 对于这种情况, 我们要查询在b中也存在的a的记录, 就可以用半连接加速

先看下普通的exists查询的效率(在我本机上的结果):

select * from a where exists(select 1 from b where a.id=b.id)

explain anlyze select * from a where exists(select 1 from b where a.id=b.id)

查询计划结果:

Merge Join (cost=20901.17..20901.71 rows=11 width=45) (actual time=226.269..226.278 rows=11 loops=1)

Merge Cond: (a.id = b.id)

-> Index Scan using a_id_idx on a (cost=0.42..35329.44 rows=1000000 width=45) (actual time=0.013..0.017 rows=12 loops=1)

-> Sort (cost=20900.74..20900.77 rows=11 width=4) (actual time=226.251..226.252 rows=11 loops=1)

Sort Key: b.id

Sort Method: quicksort Memory: 25kB

-> HashAggregate (cost=20900.44..20900.55 rows=11 width=4) (actual time=226.191..226.192 rows=11 loops=1)

Group Key: b.id

Batches: 1 Memory Usage: 24kB

-> Index Only Scan using b_id_idx on b (cost=0.42..18400.44 rows=1000000 width=4) (actual time=0.023..88.011 rows=1000000 loops=1)

Heap Fetches: 0

Planning Time: 0.984 ms

Execution Time: 226.501 ms

结果不算太好, 现在用PostgreSQL中的递归查询间接实现半连接的效果:

select * from a where a.id in(

with recursive tmp as(

select min(id) as id from b

union all

select (select min(b.id) from b where b.id>tmp.id) from tmp where tmp.id is not null

)select id from tmp where tmp.id is not null

)

查看下查询计划:

explain analyze

select * from a where a.id in(

with recursive tmp as(

select min(id) as id from b

union all

select (select min(b.id) from b where b.id>tmp.id) from tmp where tmp.id is not null

)select id from tmp where tmp.id is not null

)

查询计划结果:

Nested Loop (cost=54.01..895.84 rows=100 width=45) (actual time=0.223..0.244 rows=11 loops=1)

-> HashAggregate (cost=53.59..54.59 rows=100 width=4) (actual time=0.170..0.174 rows=11 loops=1)

Group Key: tmp.id

Batches: 1 Memory Usage: 24kB

-> CTE Scan on tmp (cost=50.32..52.34 rows=100 width=4) (actual time=0.058..0.165 rows=11 loops=1)

Filter: (id IS NOT NULL)

Rows Removed by Filter: 1

CTE tmp

-> Recursive Union (cost=0.45..50.32 rows=101 width=4) (actual time=0.056..0.160 rows=12 loops=1)

-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.055..0.056 rows=1 loops=1)

InitPlan 3 (returns $1)

-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.052..0.053 rows=1 loops=1)

-> Index Only Scan using b_id_idx on b b_1 (cost=0.42..20900.44 rows=1000000 width=4) (actual time=0.051..0.052 rows=1 loops=1)

Index Cond: (id IS NOT NULL)

Heap Fetches: 0

-> WorkTable Scan on tmp tmp_1 (cost=0.00..4.78 rows=10 width=4) (actual time=0.008..0.008 rows=1 loops=12)

Filter: (id IS NOT NULL)

Rows Removed by Filter: 0

SubPlan 2

-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)

InitPlan 1 (returns $3)

-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)

-> Index Only Scan using b_id_idx on b (cost=0.42..7803.11 rows=333334 width=4) (actual time=0.007..0.007 rows=1 loops=11)

Index Cond: ((id IS NOT NULL) AND (id > tmp_1.id))

Heap Fetches: 0

-> Index Scan using a_id_idx on a (cost=0.42..8.40 rows=1 width=45) (actual time=0.006..0.006 rows=1 loops=11)

Index Cond: (id = tmp.id)

Planning Time: 0.630 ms

Execution Time: 0.487 ms

效果非常明显了: 226.501 ms VS 0.487 ms

这就是半连接的威力!

回到最开始Polars库中的join, 要实现半连接非常简单, 给join函数的how关键字参数指定为'semi' 就可以实现高效的半连接:

import polars as pl

result = a.join(b, on='id', how='semi')

这里面on关键字用于指定连接的字段, how='semi'用于表示是半连接