Tag Archives: databases

What I am looking (and not looking) for

Since  I’ve been looking for  database developers and DBAs for quite some time now,  and since virtually everybody knows about this, people often ask me: what are you looking for? What skills and qualifications you are interested at? Who would be your ideal candidate?

Most of the time I reply: please read the job description. I know that all the career advisors tell you “apply even if you do not have some qualifications”, but as for my job postings, I actually need those qualifications which are listed as “required”, and I would really prefer the candidates, who have “is a plus” qualifications.

Also, there are definitely some big DON’Ts, which I wish I would never ever hear again during an interview:

  • when asked for the definition of the foreign key,  starting your answer from “when we need to join two tables”
  • when asked about normalization, starting from “for better performance”
  • when asked about schemas, saying that we use then for storage optimization

Today however, I was asked a different question: why you are saying that you are looking for skilled candidates, and at the same time you admit, that for anybody who will get hired there will be a long learning process? if a candidate does not know something, doesn’t it mean he does not have enough skills? Doesn’t it mean, (s)he is underqualified?

I thought for a while before I’ve responded. When I was first hired as a Postgres DBA, it was a senior position right away, although at that time I did not know any Postgres at all. But people who’ve hired me were confident that not only I can learn fast, but also that I can generalize my existing knowledge and skills and apply it in the new environment.

To build on this example, there are two pre-requisites for success: knowledge and the ability to apply it in the real-life circumstances.

I think, that a person who wants to succeed as a database developer or a DBA should possess a solid knowledge of the relational theory.  But it is not enough to memorize your Ullman or Jennifer Widom, you need to be able to connect this theory to the real-world problems. This is such an obvious thing, that I never thought I will need to write about it, but life proved me wrong :).

Same goes in the situation, when a candidate has a lot of experience with other database, not the one you need. Yes, different database systems may be different, and significantly different. Can somebody who is very skilled  Oracle DBA be qualified for a position of Postgres DBA? Yes, if this person knows how to separate generic knowledge from the systems specifics.

if you know how to read an execution plan in Oracle, and more importantly, why you need to be able to read it, you will have no problem reading execution plans in Postgres. If you used the system tables in Oracle to generate dynamic SQL,  you will know exactly what you want to look for in the Postgres catalog.  And if you know that queries can be optimized, it will help no matter what a specific DBMS is. And it won’t help, if the only thing you know is how to execute utilities.

… No idea, whether this blog post is optimistic or pessimistic, but… we are still hiring 🙂


Leave a comment

Filed under SQL, Uncategorized

Don’t forget about transactions – even when you do not write anything

A couple of months ago we started to run a job, which collects the execution statistics in our OLTP database. We’ve been running a similar job in our reporting system for a while, but there was a significant difference – which SELECTS we would consider to be long -running.

In the reporting system you expect things to be a little bit slower, so I would not care about SELECT statements, which run less than a minute. And for the longer ones if was enough to collect stats once a minute. Which meant, that we could schedule the execution using our cron-like sql-schedule-running system.

Not the case for the OLTP database. There we would consider the 30-sec running SQL statements unacceptably slow, so we definitely wanted to monitor them. But what about 1-min scheduler granularity? I can’t run a shell script in our scheduling system, it was designed for SQL execution only.

Then I though I’ve got the smartest idea ever – I suggested we should run a loop inside the function, and pass  a number of seconds it would sleep between reading the database stats   as a function parameter… and I thought it was running – I thought it for a while. There were other small issues I needed to address, and I’ve being fixing them. And then I’ve realized that something was wrong with my monitoring – the execution time for long-running transaction was suspiciously “even”, lasting for 55 sec, or 1 min 55 sec…  I was staring at the code… and suddenly understood, what was wrong. Then I quickly ran an experiment, which confirmed my suspicions.

Did you realize what have happened? Continue reading!

Continue reading

Leave a comment

Filed under Data management, SQL

April Chicago PUG is next week!

Just a very short announcement/reminder: out meetup is scheduled for next Wednesday, April 11! So far I was fortunate to have great speakers at each and single meetup, and April will be no exception –  our guest speaker will be Kirk Roybal. His talk is titled  “PostgreSQL ETL using Kettle and FDW”, and I can’t remember whether we ever had an ETL talk at our meetup. Actually – may be once, when  we’ve presented our own solution:).

For some reason almost everybody believes the you need to have some specialized system for your reporting solution, I can’t recall how many times when people ask me, what database we use in our company, and I say Postgres, the next question comes: what do you have for your ETL? And it comes as something unexpected, when I repeat: Postgres. Just Postgres.

So come to our meetup next Wednesday and find out, what you can do – with “just Postgres”

Leave a comment

Filed under events, SQL, talks

I am not sure what I fixed, but I’ve definitely fixed something

I had this problem for a while. It’s very difficult to describe, and even more difficult to report, because I do not know a good way to reproduce the problem.

The reason I am writing about it is, that if somebody ever had or will have a similar problem, then a) you know there is a way to fix it  and b) if there is more than one person experiencing the same problem, together we can find the root cause.

So… when we import data from our external service providers databases, we use a EC2 machine with a Postgres instance running on it, as our “proxy”. We have several foreign data wrappers installed on the said EC2 instance, and all the external databases (which use different DBMS’s) are mapped to the Postgres database, from where they are mapped to our Data Warehouse.  The Data Warehouse resides on RDS, which means, that only a Postgres FDW is available.

We didn’t have any issues while we were only using this setup to refresh materialized views in our Data Warehouse. But recently we started to use the same proxy to communicate with one of the external databases from the OLTP database. And that’s when strange things started to happen.

They happen when we have “a complex” query, and that’s what I can’t quantify. I can’t say “if we have more than five external tables joined” or “if we have more than one join condition on more than two tables” … it just happens at some point. What happens? The query starts to return only the first row of the result set.

When I run the same query on proxy, it would return a correct number of rows. So the specific FDW does not appear to be a problem. Then what? I do not know the answer. They way I’ve fixed it – I’ve created a view on proxy, which would join all the tables I need, and mapped this view to the OLTP database. First I was reluctant to do it, because I was sure that the conditions won’t be pushed correctly to the lowest level, and thus the query would be incredibly slow, but life proved me wrong:). It works beautifully – and very fast.

So, for now the problem is solved, but I am still wondering, what exactly causes the problem in the original query…

Leave a comment

Filed under Data management, Development and testing, SQL

The best things from 2017 and what I am looking forward to in 2018

When I am talking to people about the year 2017, and what was important, I would repeat over and over: everything I was striving for during my 30+ years of professional career had happened. All wishes have come true. Which would inevitably lead to the question: well, Hettie, what are you going to next then?

And there are plenty and plenty of things I want to do, but first I wanted to reflect on 2017 one more time. I’ve already listed multiple times all our technical accomplishments, all the wonderful things I was able to implement in a real production environment.  But when I look back at what was the best, it is definitely working with our new tech team. As I am reiterating over and over again, “the database is a service”, and whatever we are doing inside the database, can only make any impact, if our results can be utilized by “somebody” – by our end users.

And most of the time our end users are  application developers.  I can’t imagine any serious database development without continuous interaction with the rest of the development team. For years my “golden standard” for the teamwork has been my work for New York Department of Education, where I had a full support of the tech leadership, which helped me to defend my approaches and produce the results I am still happy about.

But what has being going on in the past several months is even better. The way we discuss the user stories. The way we make decisions on what should go into the application, and what – into the database. The way we debug. The way we discuss, what is critical, and what’s not. The willingness of each of the team members “to go extra mile”, to adjust in order to make others job easier. Basically, whatever I could imagine for the teamwork being perfect, is there. 🙂

Being in this industry for 34 years, I know that nothing lasts forever:). I know that every tea,, every organization evolve, and that nothing is granted forever. But I also learned to be thankful for what’s going on right now, and enjoy the moment.

Looking forward to 2018 – I hope that our team will continue to be the most amazing team ever. I am looking forward for the stress -test of our new applications with high data volumes. After all, I was designing the data storage and data access to be super-scalable, and I hope that it will work this way.

I am looking  forward to use the performance data we’ll obtain to improve our bitemporal library, and I already have some ideas of what I exactly I want to do. Actually, I have new technical ideas on almost everything I’ve developed through the last year. I still didn’t get notifications on whether any of the papers I’ve submitted for the next year conferences were accepted,  but I know that both are very good papers :), so even if none will get accepted… I will try for another conference!





Leave a comment

Filed under Development and testing, SQL, Systems, Team and teamwork

I finally figure out how it works!

I never understood, why in some cases when you look at the Postgres execution plan you see “Recheck condition”. This happens, when an index search is executed, and when I was looking at the execution plans of that kind, I never could understand, why Postgres wants to go back to the table and “recheck” something, while all the conditions should be satisfied by the index itself.

Last week however, i found myself in non-typical situation, when I had a query to optimize, and I could not bring it to the acceptable speed. If was imperative, that it would execute in less than a minute, actually highly preferable – within 30 sec. I knew that the culprit was one of the sub-selects, which was querying a very large table, and thereby I absolutely had to convince Postgres to use the index-only scan.

And it didn’t work. I’ve created a compound partial index, which would exactly describe a condition I was filtering by, and I’ve included all the fields I needed to select in this index. But the query continued to be slow, and the execution plan was clearly showing “Recheck condition”.

After a number of experiments I’ve realized, that I in order to ensure the index-only scan I need to include into the index not only the attributes I am selecting, but also the attributes which I am constraining.

Specifically, when I was running a query

SELECT loan_id,
min(date) AS first_date
FROM very_large_table
WHERE days_past_due >30 AND status in (1,2)

having the index

CREATE INDEX cond_index ON
very_large_table(loan_id, date, days_past_due)
WHERE days_past_due>30 AND status IN (1,2)

was not enough to ensure the index-only scan. Instead I had to build the following index:

CREATE INDEX cond_index_2 ON
very_large_table(loan_id, date, days_past_due, status)
WHERE days_past_due>30 AND status IN (1,2)

I was wondering for a while – why? But then I sort of thought about this “as a database” and realized, that if the status is not included into the index, then, if for some record only the status will be changed, and all indexed attributes remain the same, the rebuild of the index just won’t be triggered, and thereby the condition will require a check. On the contrary, if the status would change from “something else” to 1 or 2, the index condition will be triggered, and this record will be re-indexed.

Now I will remember!

Leave a comment

Filed under SQL

One more reminder about the conference on November 9

Just wanted to make sure one more time, that all my friends in Chicago  remember about the 2Q PG Conf, which will take place in Chicago on November 9.

I’ve already mentioned earlier, that I will be presenting at this conference, but now the whole schedule is published, and you can see, that all the speakers are just extraordinary!

Hope you can attend!

Leave a comment

Filed under events, talks