sql - Create index to speedup join on (1) ids match and (2) dates within a range -
what index should create table b optimize join on (1) ids matching , (2) date falls within range.
update: issue have hundreds of thousands of dates 25 portfolios... condition (2) work in cutting down size of resulting join...
more specifically, imagine have table a:
portfolio id | begin_date | end_date --------------------------------------------- 1 | 20150101 | 20150130 etc...
and table b:
portfolio_id | date | daily_ret ----------------------------------------------- 1 20150102 .00001 1 20150103 -.01023 etc...
the query not totally suck is:
select * tablea left join tableb b on (a.portfolio_id = b.portfolio_id) , (a.begin_date < b.date) , (b.date <= a.end_date)
i think want kinda index like: create index tableb_idx on tableb (portfolio_id, date)
but i'm total n00b sql indexes... imagine there's issue (a.begin_date < b.date) , (b.date <= a.end_date)
part?
update more specific info:
column | type | modifiers ------------+------------------+----------- yyyymmdd | integer | not null size_index | integer | not null beme_index | integer | not null ret | double precision | indexes: "portfolio_25_size_beme_pkey" primary key, btree (yyyymmdd, size_index, beme_index)
exact query:
explain analyze select exp(sum(log(1 + f_sbm.ret))) - 1 size_beme_ret, count(*) t, permno, period_start, period_end eb2.deleteme x join fama_french.portfolio_25_size_beme f_sbm on f_sbm.size_index = x.me_idx5 , f_sbm.beme_index = x.beme_idx5 , x.ps < f_sbm.yyyymmdd , f_sbm.yyyymmdd <= x.pe group permno, period_start, period_end
results (note limited eb2.deleteme 1000 rows because 180000 wouldn't finish...)
hashaggregate (cost=83599.59..83602.59 rows=200 width=24) (actual time=11209.030..11209.455 rows=949 loops=1) group key: x.permno, x.period_start, x.period_end -> merge join (cost=78736.24..83566.19 rows=1909 width=24) (actual time=1183.677..11184.172 rows=19492 loops=1) merge cond: ((f_sbm.size_index = x.me_idx5) , (f_sbm.beme_index = x.beme_idx5)) join filter: ((x.ps < f_sbm.yyyymmdd) , (f_sbm.yyyymmdd <= x.pe)) rows removed join filter: 22659494 -> sort (cost=78655.60..80136.72 rows=592450 width=20) (actual time=1182.834..1647.155 rows=545055 loops=1) sort key: f_sbm.size_index, f_sbm.beme_index sort method: external merge disk: 17328kb -> seq scan on portfolio_25_size_beme f_sbm (cost=0.00..9698.50 rows=592450 width=20) (actual time=0.014..155.770 rows=592450 loops=1) -> sort (cost=80.64..83.54 rows=1160 width=32) (actual time=0.797..1618.708 rows=22678987 loops=1) sort key: x.me_idx5, x.beme_idx5 sort method: quicksort memory: 102kb -> seq scan on deleteme x (cost=0.00..21.60 rows=1160 width=32) (actual time=0.020..0.357 rows=1000 loops=1) planning time: 0.574 ms execution time: 11228.745 ms
maybe hack, forcing join on date first pruning portfolio mismatches later runs 11x faster...
with stage1 as( select * eb2.deleteme x join fama_french.portfolio_25_size_beme f_sbm on x.ps < f_sbm.yyyymmdd , f_sbm.yyyymmdd <= x.pe ) select exp(sum(log(1+ret))) - 1 size_beme_ret, count(*) t, permno, period_start, period_end stage1 f_sbm size_index = me_idx5 , beme_index = beme_idx5 group permno, period_start, period_end
always create indexes intended being compared through equality first, , through greater/lower (or between) last (and final one, since including further columns not assist query of nature).
adding following index speed-up query considerably:
create index on fama_french.portfolio_25_size_beme (size_index, beme_index, yyyymmdd);
Comments
Post a Comment