Category Archives: research

How to build this nested JSON

This post is in response to multiple requests from the PG Conf attendees to provide a “schematic” example of how to build a record set and to convert it to a JSON object (basically to illustrate our “Connecting Galaxies” talk.

Actually, if you click on the link above and download the presentation, you will see an extra slide at the end, which provides the link to the code below… but we know that people do not like “clicking the links” ūüôā

So – here is it one more time!

--- Building complex JSON objects from multiple tables --
--- Deriving aggregates json_agg and n_array_agg
--- wrapping an array (of rows) into JSON converted to text for JDBC ----transfer 

— Test aggregation
create table aa (a_id text, a_name text);
insert into aa values (‘1’, ‘Anna’),(‘2′,’Bob’);

—- Aggregation on second level —–
create table bb (b_id text, ab_id text, b_num text, b_ph text);
insert into bb values (‘101’, ‘1’, ‘101-101’, ‘(800)-123’);
insert into bb values (‘1012’, ‘1’, ‘1012102’, ‘(800)-1234’),
( ‘201’, ‘2’, ‘201-201’, ‘(800)-1345′);

— add one more embedded array on the second level —–
create table cc (c_id text, ac_id text, c_st text, c_more text);
insert into cc values
( ’11’, ‘1’ , ‘ stst’, ‘more-more’),
( ’12’, ‘1’ , ‘ swsw’, ‘more-less’);
insert into cc values
( ’21’, ‘2’ , ‘ stst-2’, ‘more-mo2222re’),
( ’22’, ‘2’ , ‘ sws222w’, ‘more-less22’);

—- add third level , dd will be under bb —-

create table dd (d_id text, bd_id text, d_value text);
insert into dd values
( ‘1101’ , ‘101’, ‘dv11’),
( ‘1102’ , ‘101’, ‘dv12’),
( ‘2101’ , ‘201’, ‘dv22’);

—- CTE might be helpful to avoid multiple processing of big top-level tables —-

create or replace
function array_agg_next (agg_sta anyarray, val anyelement, b boolean)
returns anyarray as
$$
declare
out_array alias for $0;
begin
if b then
if agg_sta is null then
out_array := array [val ];
else
out_array := agg_sta || val ;
end if;
else
out_array := agg_sta ;
end if;
return out_array;
END;
$$ LANGUAGE plpgsql;

drop
function array_agg_final (agg_sta anyarray);
create or replace
function array_agg_final (agg_sta anyarray)
returns anyarray as
$$
declare final_array alias for $0;
begin
final_array := agg_sta;
return final_array ;
END;
$$ LANGUAGE plpgsql;

—–======================================================
create or replace
function array_agg_next (agg_sta anyarray, val anyelement, b boolean)
returns anyarray as
$$
begin
if b then
if agg_sta is null
then agg_sta := array [val ];
else agg_sta := agg_sta || val;
end if;
end if;
return agg_sta;
END;
$$ LANGUAGE plpgsql;

drop aggregate n_array_agg (anyelement, boolean);
CREATE AGGREGATE n_array_agg (anyelement, boolean) (
sfunc= array_agg_next,
— FINALFUNC = array_agg_final,
STYPE = anyarray
);

drop type if exists jaa cascade;
create type jdd as (ddi_key text, dd_value text);
create type jcc as (cc_key text, cc_st text, more text);
Create type jbb as (bb_key text, num text, ph text, “DD” jdd[]);
create type jaa as (p_key text, p_name text, “BB” jbb[], “CC” jcc[]);

—- collect rows of a specific type into an array, depends on the output type —–
create or replace
function collect_items (s text)
returns jaa[]
LANGUAGE plpgsql as
$body$
declare
result jaa[];
begin
select array_agg(single_item) into result
from
(select
row (
a_id, a_name,
n_array_agg( row (b_id, b_num, b_ph, d_agg )::jbb , b_id is not null),
n_array_agg( row (c_id, c_st, c_more )::jcc, c_id is not null)
) ::jaa as single_item
from
(
select a_id, a_name, b_id, b_num, b_ph, c_id, c_st, c_more,
n_array_agg (row(d_id, d_value)::jdd, d_id is not null) as d_agg
from
(
select a_id, a_name, b_id, b_num, b_ph, null as c_id, null as c_st, null as c_more, d_id, d_value
from aa join bb on a_id = ab_id
left join dd on b_id = bd_id
UNION ALL
select a_id, a_name, null as b_id, null as b_num, null as b_ph, c_id, c_st, c_more, null as d_id, null as d_value
from aa join cc on a_id = ac_id
) temp_table_3
group by
a_id, a_name, b_id, b_num, b_ph, c_id, c_st, c_more
) temp_table_2
group by a_id, a_name) items
where 0=1
;
return result;
end;
$body$;

—– just another output type —-
create or replace
function collect_subitems (s text)
returns jbb[]
LANGUAGE plpgsql as
$body$
declare
result jbb[];
begin
select array_agg(single_item) into result
from
(
select row(
b_id, b_num, b_ph, n_array_agg (row(d_id, d_value)::jdd, d_id is not null)
)::jbb as single_item
from bb left join dd on b_id = bd_id
group by b_id, b_num, b_ph
) items;
return result;
end;
$body$;

— wrap array into a set of text strings containing the json representation of array elements
— This function does not depend on any type
create or replace
function array_transport (all_items anyarray) returns setof text
RETURNS NULL ON NULL INPUT
LANGUAGE plpgsql as
$body$
declare
item record;
begin
foreach item in array all_items
loop
return next( to_json(item))::text;
end loop;
end;
$body$;

select * from array_transport (collect_items(‘a’));

select * from array_transport (collect_subitems(‘a’));

===============================

 

Advertisements

Leave a comment

Filed under research, SQL, talks

New features are available in the bitemporal repo – and I am so happy about it!

I Really hope that most of my follows know something about the pg_bitemporal project, because if you didn’t hear about it, you won’t be able to share my excitement!

We started to build our bitemporal library for PostgreSQL about four years ago, it was merely a “proof of concept”, and Chad Slaughter, who initiated all this work, knowing my work habits way too well, was re-iterating again and again – do not optimize it yet!

Well, I didn’t, but then I’ve joined Braviant Holdings, and a year later I was granted a permission to use our bitemporal framework in production. Some of the performance flaws became apparent even during the test run, and I was able to fix them. Later, while we were using it in production more and more, I’ve come up with new functions, UPDATE_SELECT and CORRECT_SELECT, since we actually needed them, and since the bitemporal operations were supposed to behave the same way as regular database operations.

About three weeks ago we had a very important release, which along with addressing multiple business needs, included some significant changes on the technical side. One of the consequences was, that it significantly increased the traffic on our new planform, and as a result we started to see some timeouts.

Although these timeouts were pretty rare, we saw them as a problem. I personally pledged the system will remain scalable, and now I couldn’t just go with “bitemporal updates are slow”. Yes, the execution time was at 2 to 3 seconds most of the time, but sometimes it would spike, and our microservices have a hard timeout at 10 seconds.

Some time ago I’ve already mentioned in this blog, how thankful I am for those timeouts! Nothing else foster innovation more than a necessity to address performance problems immediately, because they have a direct impact on production.

This time around I was 99.9% sure that the periodic slowness happens during the remote query, which is a part of the problematic function. Turned out, though, that this 0.01% was the case, and together with our DB team we were able to determine, that the problematic statement was the last UPDATE in the bitemporal update function. If you’d ask me a week before that, I would say, that I am not going to address the bitemporal performance for the next several months, but I had no choice.

Thanks to Boris Novikov, who helped me immensely in testing and verifying several different approaches, and eventually identified the best one, and to Chad Slaughter, who was merging my commits from 7-30 AM to 9-30 PM, so that the master branch of the bitemporal library would have the latest updates by the time of the release, and thanks to our amazing QA team, who had to run and rerun tests that day multiple times, the new bitemporal functions are now on place. Not only for Braviant Holdings, but for the whole community.

I would also like to mention, that since I was already changing the functions, I’ve fixed one long-overdue issues: all functions have versions, which are PG 10 compliant. We’ve left the old versions there, because some of the are used in the existing production systems but if you are just starting, you can use the new ones.

Check it out at https://github.com/scalegenius/pg_bitemporal

Leave a comment

Filed under news, research, SQL, Team and teamwork

The second rejected paper

****Reposting because the previous version didn’t get¬† shared on LinkedIn****

Object-relational impedance mismatch is by far my favorite research topic, mostly due to the fact that it has a very practical implementation. I would make even stronger statement: the most rewarding optimization is the one when you can reduce the number of SQL statements executed when a web page is rendered and all of a sudden it is loaded 50 times faster (and I mean actually 50 times, not figuratively speaking!).¬† It always looks like a magic – and I haven’d done anything!

This been said, the ORMs are my worst enemies, and I am always looking for opportunities to promote the better ways of communication between a database and an applications. Most of the time the human factor appears to be more important than the technological challenges, so I always think about these projects as battles.

At Braviant however, first time in my professional career I had nobody to fight about this issue – the app developers were completely on board with my approach since day one. Which allowed us to develop something really cool, and to optimize the interaction between databases and application to the point of absolute perfection. SO, when my husband suggested we’d write a short paper about this project, I had no doubt it will be accepted – because two of my previous papers on the same subject were accepted to the very serious conferences.

Life proved me wrong :), I am not going to name the conference and the workshop, but I have to make some comments about the reviews, so that the level of my frustration can be understood.

One of the reviewers asked: why we think that the number of round trips defines the response time of the web application. Another reviewer asked, whether we tried to use Mongo DB :))). And why we think that (de) serialization of the JSON takes negligible time. And why we think Hibernate is worse.

I think the only valid objection was, that the topic of the paper is not relevant to the workshop topic.  And the latter might explain the whole story.

Several years ago, when I started to attend the database conferences again, after fifteen years of absence, I made an observation that a significant number of the attendees never saw the real applications, and never had deal with performance problems, Fortunately, I’ve also met and got to know some really outstanding researches, whom I admire and feel honored to be aquatinted with, so… I am sure I will find the right place to showcase our work.

And may be it’s time to get back to my old “HDAT” workshop idea,,,

And for my fellow Chicagoans: I will be presenting this work this Tuesday, Feb 13 at the Chicago PUG meetup!

Leave a comment

Filed under research

The second rejected paper: the ORIM again

Object-relational impedance mismatch is by far my favorite research topic, mostly due to the fact that it has a very practical implementation. I would make even stronger statement: the most rewarding optimization is the one when you can reduce the number of SQL statements executed when a web page is rendered and all of a sudden it is loaded 50 times faster (and I mean actually 50 times, not figuratively speaking!).¬† It always looks like a magic – and I haven’d done anything!

This been said, the ORMs are my worst enemies, and I am always looking for opportunities to promote the better ways of communication between a database and an applications. Most of the time the human factor appears to be more important than the technological challenges, so I always think about these projects as battles.

At Braviant however, first time in my professional career I had nobody to fight about this issue – the app developers were completely on board with my approach since day one. Which allowed us to develop something really cool, and to optimize the interaction between databases and application to the point of absolute perfection. SO, when my husband suggested we’d write a short paper about this project, I had no doubt it will be accepted – because two of my previous papers on the same subject were accepted to the very serious conferences.

Life proved me wrong :), I am not going to name the conference and the workshop, but I have to make some comments about the reviews, so that the level of my frustration can be understood.

One of the reviewers asked: why we think that the number of round trips defines the response time of the web application. Another reviewer asked, whether we tried to use Mongo DB :))). And why we think that (de) serialization of the JSON takes negligible time. And why we think Hibernate is worse.

I think the only valid objection was, that the topic of the paper is not relevant to the workshop topic.  And the latter might explain the whole story.

Several years ago, when I started to attend the database conferences again, after fifteen years of absence, I made an observation that a significant number of the attendees never saw the real applications, and never had deal with performance problems, Fortunately, I’ve also met and got to know some really outstanding researches, whom I admire and feel honored to be aquatinted with, so… I am sure I will find the right place to showcase our work.

And may be it’s time to get back to my old “HDAT” workshop idea,,,

And for my fellow Chicagoans: I will be presenting this work this Tuesday, Feb 13 at the Chicago PUG meetup!

8 Comments

Filed under research

Our bitemporal paper was rejected, and how I feel about it

Actually, this winter I had not one, but two papers rejected. And although I never dispute the rejections (it just means I failed to present my work adequately), I wanted to reflect on why both papers were rejected, and what I can do to make them accepted to other conferences.

With our bitemporal paper I was really upset that it didn’t make it to ICDE 2018, because I know that the work itself was magnitudes better than the work, which was accepted for ICDE 2016. Which leaves me with two options: either the topic was not relevant for the Industrial track, or we didn’t present our work well enough, so that it’s novelty would be visible.

I think its’ more that we didn’t explain ourselves well enough. I was trying not to dedicate 1/3 of the paper to¬† explaining the theory which lays underneath our implementation, and now I think it was a mistake. I didn’t elaborate on the fact, that our second dimension is asserted time, not system time, and what is a semantical difference. So when our our reviewers are saying – “everybody have bitemporal time” – yes, that’s correct, but our two-dimensional time¬† is different!

I know that the “asserted time” concept is not that easy to grasp when you read about it for the first time, and we didn’t provide any formal definitions. Nor did we provide any formal definitions for the bitemporal operations. It does not matter, that we’ve followed the asserted versioning framework bible… We should have give the formal definitions, and we should have highlighted, that it’s not “bitemporal implementation for Postgres”, but that “we use Postgres to implement the asserted versioning framework, because Postgres has some cool features, which makes it easier”.

Oh, well. There is always a next conference :). Also, I think we should separate this paper into smaller pieces – this one was an attempt to summarize three years of development.

Something to work on! And also – to continue development of the bitemporal library itself.

Leave a comment

Filed under research

From theory to practice

For the past several months I am implementing the bitemporal framework on the real life objects, not on the lab mice :). And this process was quite a revelation!

I’ve written the functions for bitemporal basic operations almost two years ago, and talked about them on several conferences and workshops. I could not imagine something can go wrong with them – and yet it did. And that’s exactly what happens when all your test cases are cloned lab mice!

One of the first errors I’ve got was an empty assertion interval, and that’s when I’ve realized than we never discussed the relations between transactions and bitemporal operations. Well, a transaction is a transaction, isn’t it? Nobody is supposed to see what’s inside, until transaction is finished – committed or rolled back. So… if there are several modifications (say INSERT, UPDATE and CORRECT for the same logical record) within one transaction… what we are supposed to see when transaction is committed? Just an INSERT, if the first operation was INSERT? But this “won’t be true”!

Yes, but on the other hand, imagine what will happen if we would record the “interim” state, and then later we would like to run a query “as asserted” at some time in the past, and at that exact moment some transactions will be in the uncommitted state? Then we will get results which will be in the inconsistent ¬†status. As of now I didn’t come up with how I want these situations to be handled. I am almost convinced that I want to give a user an option: if you want to be “anti-transactional”, you can :)). But then you’ll need to accept the consequences.

Another set of problems is rather philosophical: do we believe in reincarnation? ūüôā More precisely, if an object is “bitemporally deleted”, and then a new object with the same business key value is created, is this “the same object” or a “new object”? Both ways can be supported, but I think that by default we should assume a “formal approach”, and say the this is “the same” object. And if the real world (i.e. business rules) is¬†such, that the new object is a different object… well, that means, that something else should be included into the business key.¬†For example, if the SSN is reused, then¬†we need an extra piece of information, like person’d data of birth.

Related questions: can we update a deleted (inactive) record? What are the differences between UPDATE and CORRECTION if the date ranges are “equal”? ¬†I can only imagine how many issues like this are just waiting to be discovered!

Leave a comment

Filed under Data management, Development and testing, research, SQL

ICDE 2017 – Day 3

This will be again more a note to myself to write in more details about what I’ve learned at ICDE 2017.

I didn’t stay the whole Day 3, but I made sure to pay for the TSA pre-check and use the fact that the conference venue was so close to the airport. ¬†The main events of Day 3 were:

  • The keynote by Pavel Pevzner about the “New revolution” in online education. ¬†I can’t say I liked it, because I disagree with a lot of what was said, but I it was something which would make you think
  • The Industry 2 session, which was to be honest less interesting than Industry 1, although quite educational. ¬†The last presentation made me think again that the way we use the FDW for populating our Data Mart is something not convetional, and probably should ¬† be publicized more.

During the conference people were asking me what y company is doing, and I’ve realized that our data modeling and predictive analytics (which I do not know much about) were of the most interest. Also, I am always saying the “we do not have any big data”, but now, seen what other people consider being “big data” I am starting to think that may be we have :).

Overall I am very excited about what I’ve learned, about the people I’ve met, adn I want to reinvent my life again, and to do all those great things…. and to submit a paper to ICDE 2018, of cause :).

Leave a comment

Filed under events, People, research