今天看了一下PostgreSQL row_number的实现过程。之前一直好奇窗口函数是什么,原理是什么,今天稍稍解惑。下面就以row_number为例进行介绍: 窗口函数:
窗口函数在一组表行中执行计算,这些表行以某种方式与当前行相关。 这与使用聚合函数可以完成的计算类型相当。 但是,窗口函数不会导致行被分组到单个输出行,就像非窗口聚合调用一样。 相反,行保留其独立的身份。 在幕后,窗口功能不仅可以访问查询结果的当前行。
row_number使用示例:
[postgres@shawnpc bin]$ ./psql
psql (13devel)
Type "help" for help.
postgres=# select row_number() over() as rownum, id from aa;
rownum | id
--------+----
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 | 6
7 | 7
8 | 8
9 | 9
10 | 10
(10 rows)
postgres=#
row_number代码:
/*
* row_number
* just increment up from 1 until current partition finishes.
*/
Datum
window_row_number(PG_FUNCTION_ARGS)
{
WindowObject winobj = PG_WINDOW_OBJECT(); //获取窗口函数内存上下文
int64 curpos = WinGetCurrentPosition(winobj); //初始化位置
WinSetMarkPosition(winobj, curpos); //将行号和位置绑定
PG_RETURN_INT64(curpos + 1); //返回行号
}
看起来似乎非常简单,但是经过调试发现这里和执行计划耦合度很高: 设置函数断点:
Breakpoint 1, window_row_number (fcinfo=0x7ffc158cce90) at windowfuncs.c:83
83 {
(gdb) bt
#0 window_row_number (fcinfo=0x7ffc158cce90) at windowfuncs.c:83
#1 0x0000000000632956 in eval_windowfunction (perfuncstate=0x1ca3768, result=0x1ca3738, isnull=0x1ca3750, winstate=0x1ca23e8,
winstate=0x1ca23e8) at nodeWindowAgg.c:1056
#2 0x0000000000635174 in ExecWindowAgg (pstate=0x1ca23e8) at nodeWindowAgg.c:2198
#3 0x0000000000605b82 in ExecProcNode (node=0x1ca23e8) at ../../../src/include/executor/executor.h:240
#4 ExecutePlan (execute_once=<optimized out>, dest=0x1c125e8, direction=<optimized out>, numberTuples=0, sendTuples=true,
operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1ca23e8, estate=0x1ca21c0) at execMain.c:1648
#5 standard_ExecutorRun (queryDesc=0x1c0eb70, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:365
#6 0x000000000074c81b in PortalRunSelect (portal=portal@entry=0x1c52e90, forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x1c125e8) at pquery.c:929
#7 0x000000000074db60 in PortalRun (portal=portal@entry=0x1c52e90, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x1c125e8, altdest=altdest@entry=0x1c125e8,
completionTag=completionTag@entry=0x7ffc158cd7e0 "") at pquery.c:770
#8 0x0000000000749bc6 in exec_simple_query (query_string=0x1becfa0 "select row_number() over() as rownum, id from aa;") at postgres.c:1231
#9 0x000000000074aea2 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1c16f70, dbname=0x1c16e98 "postgres", username=<optimized out>)
at postgres.c:4256
#10 0x000000000047e579 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4446
#11 BackendStartup (port=0x1c0ee70) at postmaster.c:4137
#12 ServerLoop () at postmaster.c:1704
#13 0x00000000006ddb9d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1be7bb0) at postmaster.c:1377
#14 0x000000000047f243 in main (argc=3, argv=0x1be7bb0) at main.c:210
从上可知,首先row_number函数执行是在执行计划执行之后进行调用的。 首先进入的是ExecutePlan:
static void
ExecutePlan(EState *estate,
PlanState *planstate,
bool use_parallel_mode,
CmdType operation,
bool sendTuples,
uint64 numberTuples,
ScanDirection direction,
DestReceiver *dest,
bool execute_once)
{
TupleTableSlot *slot;
uint64 current_tuple_count;
略
for (;;)
{
/* Reset the per-output-tuple exprcontext */
ResetPerTupleExprContext(estate);
/*
* Execute the plan and obtain a tuple
*/
slot = ExecProcNode(planstate);
略
}
这里调用了ExecProcNode(宏定义,调用了ExecWindowAgg),ExecWindowAgg调用了eval_windowfunction,而正是eval_windowfunction完成了row_number的调用,并且构建了相关数据。通过调试可以发现,多少行数据就会调用多少次row_number。
eval_windowfunction:
/*
* eval_windowfunction
*
* Arguments of window functions are not evaluated here, because a window
* function can need random access to arbitrary rows in the partition.
* The window function uses the special WinGetFuncArgInPartition and
* WinGetFuncArgInFrame functions to evaluate the arguments for the rows
* it wants.
*/
static void
eval_windowfunction(WindowAggState *winstate, WindowStatePerFunc perfuncstate,
Datum *result, bool *isnull)
{
LOCAL_FCINFO(fcinfo, FUNC_MAX_ARGS);
MemoryContext oldContext;
oldContext = MemoryContextSwitchTo(winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory); //切换至tuple的内存上下文
/*
* We don't pass any normal arguments to a window function, but we do pass
* it the number of arguments, in order to permit window function
* implementations to support varying numbers of arguments. The real info
* goes through the WindowObject, which is passed via fcinfo->context.
*/
InitFunctionCallInfoData(*fcinfo, &(perfuncstate->flinfo),
perfuncstate->numArguments,
perfuncstate->winCollation,
(void *) perfuncstate->winobj, NULL);//初始化fcinfo,为下面调用函数使用
/* Just in case, make all the regular argument slots be null */
for (int argno = 0; argno < perfuncstate->numArguments; argno++)
fcinfo->args[argno].isnull = true;//见注释
/* Window functions don't have a current aggregate context, either */
winstate->curaggcontext = NULL;//见注释
*result = FunctionCallInvoke(fcinfo);//调用函数
*isnull = fcinfo->isnull;
/*
* Make sure pass-by-ref data is allocated in the appropriate context. (We
* need this in case the function returns a pointer into some short-lived
* tuple, as is entirely possible.)
*/
if (!perfuncstate->resulttypeByVal && !fcinfo->isnull &&
!MemoryContextContains(CurrentMemoryContext,
DatumGetPointer(*result)))
*result = datumCopy(*result,
perfuncstate->resulttypeByVal,
perfuncstate->resulttypeLen);
//见注释
MemoryContextSwitchTo(oldContext); //切换回原上下文
}
至此分析结束。