Category Archives: SQL

Databases are not sexy?

I’ve heard this line from a former colleague of mine. He was explaining why there are so little database people around, why IT students are not specializing in databases. That was his answer – that it not cool. “WEB-designer” sounds cool, “database developer” does not.

Several months passed since I heard that, and I was thinking: I should write a blog post about it! However, honestly – what else I can say except the things I’ve already said in this blog multiple times? That there is nothing more exciting, than exploring the endless possibilities of SQL, that nothing can be more rewarding than applying your magic to the jobs, which runs for an hour, and all of a sudden it runs in less than a minute and produces the same result:)

I suspect that the general public does not think that there is something behind a web page, and when somebody experiences a website slowness, they refer to it as “the internet is slow.” Also, the buzz words like “Big Data” often send a message, that “the databases are outdated,” and that there is something bigger, and cooler, and more fashionable, than just”databases,” which does not help a bit.

As I always like to be practical, and not only state a problem but come up with a way to solve it, I am now thinking about how to bridge this gap. I want to find ways to reach out to college and high school students and to give them some exposure to the Wonderful World Of Data. A couple of years ago when I was attending one of the Big Conferences, I’ve heard a discussion regarding “what our students want to be taught.” That was a time of Big Data just becoming a Big Deal :). Honestly, my opinion is that the student’s interest should not drive a curriculum entirely 🙂 and that that’s the right place to interfere.

Is anybody interested in joining me in this effort?

Advertisements

Leave a comment

Filed under Data management, SQL

Working Under Pressure, Next Chicago PUG And The Wonders of FDW

Anybody likes working under pressure? Anybody likes production crisis? Nobody would admit it, however…

Try to recall something really exciting which has happened to you in your work environment. No, not your recent promotion :). Try to remember last time you did something great, and things worked fabulously. May be you were not running around, as I do in such cases, telling everybody what a cool thing I just came up with, but even without this running around, most likely your the last great thing happened when there was some serious production issue and you had very limited time to come up with an entirely new solution.

You know how I love sharing the story of New York Hight School Application Processing System! And when I tell this story, I always mention, that I had very limited time to fix everything, and worked 16-hours days for 2 weeks.

It was not that kind of crisis recently at my work, but it was still a production issue . And as 99% of all of our production issues, this one have happened during some interaction with our third party service provider, who’s utilizing MySQL. To be honest, knowing what I know now about MySQL as of now, I am very close to making a statement, that it in not a database! Because, anything which wants to be treated as a database, should know how to use indexes :).

To make the story short, we’ve figured it all out, and rewrote a query – in a most absurd way :), I have to admit. But after this one, and a whole bunch of other similar exercises, I think I have a very good understanding of how things work on the MySQL, what is getting lost when we are using FDW, and what and how can be mitigated.

And guess what – I will be taking about all this cool stuff next Wednesday, at the Chicago PUG. Come and join the conversation!

Leave a comment

Filed under Data management, SQL

April Chicago PUG meetup is tomorrow!

Just one last reminder about tomorrow’s Chicago PUG Meetup: tomorrow we are hosting CockroachDB Lab, and I encourage everybody, including our members who were hibernating during this cold winter, to attend!

Here is a short summary of the talk:

CockroachDB is a distributed SQL database at scale for cloud native. It was Inspired by Spanner, but created as an open-source Postgres-compatible database with strong ACID guarantees. It gives an alternative to a standalone Postgres server to answer the most important questions we have to handle after we take our databases into production: How can I keep the database from going down? How can I migrate it between clouds? And, considering it’s a cluster, how can you keep the database instances consistent with each other?

CockroachDB takes on these challenges by allowing you to write your Postgres apps as if you’re talking to a single instance: your SQL code works the same way and you use standard Postgres drivers. But instead of replicating the data, the data is automatically written to a raft-consensus cluster that guarantees serializable isolation to stay consistent automatically.

CockroachDB is easy to play with in a lab and this talk will show you how. Your database can then seamlessly migrate to the cloud, and even scale between datacenters giving your Postgres database flexibility as you move from development through production. Come to this session to see how this technology works and learn how to download the open source and work with it in your own environment.

If you are planning to attend, but didn’t register yet, please take a moment to RSVP on the meetup page here. (to ensure we have enough pizza!)

Leave a comment

Filed under events, SQL

DB developers and App developers

Last week when I was at the PG Conf in New York, I met with my former co-worker, with whom we worked together while being on the New York Department of education project. She was a Java developer, and I was an Oracle developer.  I was implementing a solution, which was very close to the idea I am implementing now – building Oracle packages for each application endpoint, so that an application could only call functions, not separate SQL statements. Our collaboration was not always easy and peaceful those days, and when I was describing to her the specifics of the work we were presenting, I told her: see, if I would return to you the same type of objects back then, you won’t have any trouble with me!

She replied: Oh yes,  if that would be the case, I wouldn’t have any trouble with you! And the way she’d said that reminded me that she indeed had some troubles with me back then… And I thought to myself, that perhaps it’s not only the team I work with now is better than any other team I’ve worked with before (though it is still the case :)), but also that I myself had changed. And now I am way more flexible and willing to walk this extra mile, than fifteen years ago…  and our success is a product of collaboration, of everybody being able and willing to work together to find the right solution.

Leave a comment

Filed under People, SQL, Systems

How to build this nested JSON

This post is in response to multiple requests from the PG Conf attendees to provide a “schematic” example of how to build a record set and to convert it to a JSON object (basically to illustrate our “Connecting Galaxies” talk.

Actually, if you click on the link above and download the presentation, you will see an extra slide at the end, which provides the link to the code below… but we know that people do not like “clicking the links” 🙂

So – here is it one more time!

--- Building complex JSON objects from multiple tables --
--- Deriving aggregates json_agg and n_array_agg
--- wrapping an array (of rows) into JSON converted to text for JDBC ----transfer 

— Test aggregation
create table aa (a_id text, a_name text);
insert into aa values (‘1’, ‘Anna’),(‘2′,’Bob’);

—- Aggregation on second level —–
create table bb (b_id text, ab_id text, b_num text, b_ph text);
insert into bb values (‘101’, ‘1’, ‘101-101’, ‘(800)-123’);
insert into bb values (‘1012’, ‘1’, ‘1012102’, ‘(800)-1234’),
( ‘201’, ‘2’, ‘201-201’, ‘(800)-1345′);

— add one more embedded array on the second level —–
create table cc (c_id text, ac_id text, c_st text, c_more text);
insert into cc values
( ’11’, ‘1’ , ‘ stst’, ‘more-more’),
( ’12’, ‘1’ , ‘ swsw’, ‘more-less’);
insert into cc values
( ’21’, ‘2’ , ‘ stst-2’, ‘more-mo2222re’),
( ’22’, ‘2’ , ‘ sws222w’, ‘more-less22’);

—- add third level , dd will be under bb —-

create table dd (d_id text, bd_id text, d_value text);
insert into dd values
( ‘1101’ , ‘101’, ‘dv11’),
( ‘1102’ , ‘101’, ‘dv12’),
( ‘2101’ , ‘201’, ‘dv22’);

—- CTE might be helpful to avoid multiple processing of big top-level tables —-

create or replace
function array_agg_next (agg_sta anyarray, val anyelement, b boolean)
returns anyarray as
$$
declare
out_array alias for $0;
begin
if b then
if agg_sta is null then
out_array := array [val ];
else
out_array := agg_sta || val ;
end if;
else
out_array := agg_sta ;
end if;
return out_array;
END;
$$ LANGUAGE plpgsql;

drop
function array_agg_final (agg_sta anyarray);
create or replace
function array_agg_final (agg_sta anyarray)
returns anyarray as
$$
declare final_array alias for $0;
begin
final_array := agg_sta;
return final_array ;
END;
$$ LANGUAGE plpgsql;

—–======================================================
create or replace
function array_agg_next (agg_sta anyarray, val anyelement, b boolean)
returns anyarray as
$$
begin
if b then
if agg_sta is null
then agg_sta := array [val ];
else agg_sta := agg_sta || val;
end if;
end if;
return agg_sta;
END;
$$ LANGUAGE plpgsql;

drop aggregate n_array_agg (anyelement, boolean);
CREATE AGGREGATE n_array_agg (anyelement, boolean) (
sfunc= array_agg_next,
— FINALFUNC = array_agg_final,
STYPE = anyarray
);

drop type if exists jaa cascade;
create type jdd as (ddi_key text, dd_value text);
create type jcc as (cc_key text, cc_st text, more text);
Create type jbb as (bb_key text, num text, ph text, “DD” jdd[]);
create type jaa as (p_key text, p_name text, “BB” jbb[], “CC” jcc[]);

—- collect rows of a specific type into an array, depends on the output type —–
create or replace
function collect_items (s text)
returns jaa[]
LANGUAGE plpgsql as
$body$
declare
result jaa[];
begin
select array_agg(single_item) into result
from
(select
row (
a_id, a_name,
n_array_agg( row (b_id, b_num, b_ph, d_agg )::jbb , b_id is not null),
n_array_agg( row (c_id, c_st, c_more )::jcc, c_id is not null)
) ::jaa as single_item
from
(
select a_id, a_name, b_id, b_num, b_ph, c_id, c_st, c_more,
n_array_agg (row(d_id, d_value)::jdd, d_id is not null) as d_agg
from
(
select a_id, a_name, b_id, b_num, b_ph, null as c_id, null as c_st, null as c_more, d_id, d_value
from aa join bb on a_id = ab_id
left join dd on b_id = bd_id
UNION ALL
select a_id, a_name, null as b_id, null as b_num, null as b_ph, c_id, c_st, c_more, null as d_id, null as d_value
from aa join cc on a_id = ac_id
) temp_table_3
group by
a_id, a_name, b_id, b_num, b_ph, c_id, c_st, c_more
) temp_table_2
group by a_id, a_name) items
where 0=1
;
return result;
end;
$body$;

—– just another output type —-
create or replace
function collect_subitems (s text)
returns jbb[]
LANGUAGE plpgsql as
$body$
declare
result jbb[];
begin
select array_agg(single_item) into result
from
(
select row(
b_id, b_num, b_ph, n_array_agg (row(d_id, d_value)::jdd, d_id is not null)
)::jbb as single_item
from bb left join dd on b_id = bd_id
group by b_id, b_num, b_ph
) items;
return result;
end;
$body$;

— wrap array into a set of text strings containing the json representation of array elements
— This function does not depend on any type
create or replace
function array_transport (all_items anyarray) returns setof text
RETURNS NULL ON NULL INPUT
LANGUAGE plpgsql as
$body$
declare
item record;
begin
foreach item in array all_items
loop
return next( to_json(item))::text;
end loop;
end;
$body$;

select * from array_transport (collect_items(‘a’));

select * from array_transport (collect_subitems(‘a’));

===============================

 

Leave a comment

Filed under research, SQL, talks

I knew I won’t like this feature! About the parallel execution

Many years ago, when Postgres 9.6 was still in making, my coworker said to me with excitement: Hettie, Postgres will now have the ability to run queries in parallel! There can be only four parallel processes, but still, isn’t in nice?!

And I remember exactly what I’ve replied: no, I do not like this feature a bit! You know why? Because executing queries in parallel would rarely solve performance problems. In fact, if four parallel workers would solve your performance problems, they were not really problems! It can do more harm than good, because it can mask some real performance problems for a while, and then they will turn just to be more severe.

What happened then – I’ve started a new chapter of my life at Braviant, and we had Postgres 9.5 then, and then for almost 3 years it was like I never had time to stop and upgrade :). But now I have a team, so we’ve finally planned the upgrade, and since we were already four versions behind, we planned upgrade to 9.6 and immediately to PG 10.

We’ve started from our Data Warehouse. First – in the staging environment, we we’ve tested, and observed the execution for some time. And then on the mirroring instance, and only then – on production.

And then it started! Seriously, out of all data refreshes this one is only one, which is important for the OLTP instance, because it sends data to one of our client interfaces. It started to behave inconsistently, Sometimes it would be just fine. Other times, instead of about 50 seconds it has been running for an hour, and probably won’t finish if we won’t kill it. Yes, it was obvious, that something did change in the execution plan after the upgrade. But what?! From the first glance the execution plan looked the same, all HASH JOINS, which you would expect, when you join tables with no restrictive conditions.

But it was still painfully slow. What was more puzzling – I could take out of the equation JOIN to any table, and performance would be unpredictable. After several dozen of attempts to make things run decided to take a closer look at the execution plan. And you know what I saw?! Yes, parallel execution/! Four joins were parallelized, which resulted in the execution time been really horrible. After the issue was found, the only thing left was to figure out, how to turn this feature off 🙂

Leave a comment

Filed under Data management, Development and testing, SQL

One more time on the state of optimization

I just have to show some of the execution time graphs, and how the have changed after the optimized versions of the respective functions were deployed:

I know that many people are wondering looking at the second image, why I am striving to optimize things which are already running super-fast?

It’s not because I am trying to demonstrate my superpowers, it’s because i know for the fact, that with the database size we currently have, that is the right execution time. What does it mean? it means, that if the execution time is more than that, it indicates the wrong execution plan.

All these optimizations have been performed on our OLTP database, which means that all of these queries are “small queries”, retrieving a relatively small number of records. Which implies, that the appropriate indexes should be used, and that the execution plans should show the NESTED LOOP join algorithm. When I see the execution time of 500 ms, it tells me that there is at least one full table scan inside. Which in turn, means, that the execution time will be increasing, when the data volumes will be growing. Which is not good, if we are building a scalable system.

Another important thing to consider is that all these small queries cannot be “parallelized” to speed up the execution. We are in the OLTP environment, not OLAP. I know that I can’t rely on switching to the larger AWS instance, because 1) this process gets out of control very fast 2) does not help. Seen the execution times like on both of these pictures, like “I can’t see it” just proves, that the functions are performing as expected.

Leave a comment

Filed under Data management, Development and testing, SQL