最佳体验请使用Chrome67及以上版本、火狐、Edge、Safari浏览器 ×

创建银行
创建开票

    Polars中的半连接与Postgresql的半连接

    作者:杨博@勾股弦数据 阅读69 2024/09/18 01:26:28 文章 原创 公开

    最近在了解Polars库的Join操作时, 看到一个semi-join 的概念就是半连接. 所谓半连接是指当一张表在另一张表找到匹配的记录后, 仅返回第一张表中的记录, 与条件连接不同的是, 就算在右表中找到了若干条匹配的记录, 也仅仅会返回左表中的一条记录, 而右表的记录不会返回! 换句话说半连接通常是in 或者exists这样的查询, 通常结构如下:

    select * from outer_tables where exists(select 1 from inner_table where outer_tables.xxx=inner_table.xxx) and ...

    或者

    select * from outer_tables where expr in (select ... from inner_table ...) and ...


    这种查询的特点是我们只关心outer_table中与inner_table中相关字段相匹配的记录。换句话说,最后的结果集是在outer_tables中的,而semi-join的作用只是对outer_tables中的记录进行筛选。这也是我们进行半连接优化的基础,即通过半连接从inner_table中获取到最少量的足以对outer_tables记录进行筛选的信息就足够了。


    下面用PostgreSQL中的递归来模拟半连接


    create table a(id int,  info text, ts timestamptz);


    --用于创建一个跟表a具有相同表结构的表

    create table b as select * from a where 1=0  --另一个优雅点的写法是: create table b(like a); 


    --构造大量模拟数据插入表a

    insert into a

    select id,md5(random()::text) as info,now() as ts from  generate_series(0, 999999) as t(id);


    insert into b

    select random()*10 as id ,md5(random()::text) as info,now() as ts from  generate_series(0, 999999) as t(id);


    --表a跟表b分别创建索引

    create index on a(id);

    create index on b(id);


    注意: random()函数用于返回一个0到1的随机浮点数, 所以b中的1000000条记录中含有大量的重复id, 对于这种情况, 我们要查询在b中也存在的a的记录, 就可以用半连接加速


    先看下普通的exists查询的效率(在我本机上的结果):

    select * from a where exists(select 1 from b where a.id=b.id)

    explain anlyze select * from a where exists(select 1 from b where a.id=b.id)

    查询计划结果:

    Merge Join  (cost=20901.17..20901.71 rows=11 width=45) (actual time=226.269..226.278 rows=11 loops=1)

      Merge Cond: (a.id = b.id)

      ->  Index Scan using a_id_idx on a  (cost=0.42..35329.44 rows=1000000 width=45) (actual time=0.013..0.017 rows=12 loops=1)

      ->  Sort  (cost=20900.74..20900.77 rows=11 width=4) (actual time=226.251..226.252 rows=11 loops=1)

            Sort Key: b.id

            Sort Method: quicksort  Memory: 25kB

            ->  HashAggregate  (cost=20900.44..20900.55 rows=11 width=4) (actual time=226.191..226.192 rows=11 loops=1)

                  Group Key: b.id

                  Batches: 1  Memory Usage: 24kB

                  ->  Index Only Scan using b_id_idx on b  (cost=0.42..18400.44 rows=1000000 width=4) (actual time=0.023..88.011 rows=1000000 loops=1)

                        Heap Fetches: 0

    Planning Time: 0.984 ms

    Execution Time: 226.501 ms


    结果不算太好, 现在用PostgreSQL中的递归查询间接实现半连接的效果:

    select * from a where a.id in(

            with recursive tmp as(

                 select min(id) as id from b

                 union all

                 select (select min(b.id) from b where b.id>tmp.id) from  tmp where tmp.id is not null

            )select id from tmp where tmp.id is not null

    )


    查看下查询计划:

    explain analyze

        select * from a where a.id in(

        with recursive tmp as(

             select min(id) as id from b

             union all

             select (select min(b.id) from b where b.id>tmp.id) from  tmp where tmp.id is not null

        )select id from tmp where tmp.id is not null

    )


    查询计划结果:

    Nested Loop  (cost=54.01..895.84 rows=100 width=45) (actual time=0.223..0.244 rows=11 loops=1)

      ->  HashAggregate  (cost=53.59..54.59 rows=100 width=4) (actual time=0.170..0.174 rows=11 loops=1)

            Group Key: tmp.id

            Batches: 1  Memory Usage: 24kB

            ->  CTE Scan on tmp  (cost=50.32..52.34 rows=100 width=4) (actual time=0.058..0.165 rows=11 loops=1)

                  Filter: (id IS NOT NULL)

                  Rows Removed by Filter: 1

                  CTE tmp

                    ->  Recursive Union  (cost=0.45..50.32 rows=101 width=4) (actual time=0.056..0.160 rows=12 loops=1)

                          ->  Result  (cost=0.45..0.46 rows=1 width=4) (actual time=0.055..0.056 rows=1 loops=1)

                                InitPlan 3 (returns $1)

                                  ->  Limit  (cost=0.42..0.45 rows=1 width=4) (actual time=0.052..0.053 rows=1 loops=1)

                                        ->  Index Only Scan using b_id_idx on b b_1  (cost=0.42..20900.44 rows=1000000 width=4) (actual time=0.051..0.052 rows=1 loops=1)

                                              Index Cond: (id IS NOT NULL)

                                              Heap Fetches: 0

                          ->  WorkTable Scan on tmp tmp_1  (cost=0.00..4.78 rows=10 width=4) (actual time=0.008..0.008 rows=1 loops=12)

                                Filter: (id IS NOT NULL)

                                Rows Removed by Filter: 0

                                SubPlan 2

                                  ->  Result  (cost=0.45..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)

                                        InitPlan 1 (returns $3)

                                          ->  Limit  (cost=0.42..0.45 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=11)

                                                ->  Index Only Scan using b_id_idx on b  (cost=0.42..7803.11 rows=333334 width=4) (actual time=0.007..0.007 rows=1 loops=11)

                                                      Index Cond: ((id IS NOT NULL) AND (id > tmp_1.id))

                                                      Heap Fetches: 0

      ->  Index Scan using a_id_idx on a  (cost=0.42..8.40 rows=1 width=45) (actual time=0.006..0.006 rows=1 loops=11)

            Index Cond: (id = tmp.id)

    Planning Time: 0.630 ms

    Execution Time: 0.487 ms


    效果非常明显了: 226.501 ms VS 0.487 ms

    这就是半连接的威力!


    回到最开始Polars库中的join, 要实现半连接非常简单, 给join函数的how关键字参数指定为'semi' 就可以实现高效的半连接:

    import polars as pl

    result = a.join(b, on='id', how='semi')

    这里面on关键字用于指定连接的字段, how='semi'用于表示是半连接









    声明:本网站部分内容来源于网络,版权归原权利人所有,其观点不代表本网站立场;本网站视频或图片制作权归当前商户及其作者,涉及未经授权的制作均须标记“样稿”。如内容侵犯了您相关权利,请及时通过邮箱service@ichub.com与我们联系。
     0  0

    微信扫一扫:分享

    微信里点“+”,扫一扫二维码

    便可将本文分享至朋友圈。

      
    
    
    分享
     0
      验证