parallel query option
We are running a BI w/mid-size data warehouse. PostgreSQL scales reasonably well for the BI app - i.e. when there is enough users running dashboards, ad-hoc queries, etc the server is as busy as expected.
HOWEVER when running the ETL overnight most of the processors are iddle - BECAUSE the 1 process would only utilize one processor (core).
And - trust me on this - i simply do not want to listen to smart advice like "why don't you write parallel ETL" or similar :)
There are other databases having this functionality so why not PostgreSQL?
What about parallelizing aggregation calculations. There is no reason why separate groupings can't be calculated in separate threads within the same process.
If I get you right Nick, you are thinking that parallel queries is akin to sharing the connection.or transaction mechanism (so you can run several queries in parallel using one connection), but it is in practice quite the opposite. Think of parallel queries more along the lines of database partitioning.
When you tell the optimizer to use a parallel query feature you give it permission to break up a query or procedural script into multiple connections that are not shared (and by script I mean pl/pgsql or pl/sql for example... and really we're talking about the queries or DML within the scripts). The optimizer will choose the best way to break it up and assign each part to its own processor. This allows the DB engine to work on different parts of the query in parallel on the server, where each part is assigned its own processor (as opposed to one processor for one query). So if you tell it to use 4 processors, the query will be processed on 4 processors where normally it would only be processed on one. But the downside is that if this example were done on an 8 processor machine, you only have half available for other work. So this is best used on machines with many processors. And the point is you only need to tell the optimizer with a hint to do this and it will figure out the best way to parallelize the query.
But the bottom line is that this is to process the parts of a single query or script in parallel. And normally you only do this when you are doing something that requires a lot of horsepower to run in a serious time constraint, and when you aren't going to impact others. So for example, data warehouse ETL overnight, where you might have gigs of data that need to be processed in a couple of hours or less.
So it really is quite the opposite to transaction sharing. In PQ's the query theoretically can be one transaction, and it won't share it. And it will hog multiple processors.
Does this make better sense as to what this is for?
Or maybe I'm not getting you. Is what you are talking about something that allows a singe query to be run on multiple processors?
Correct me if I'm wrong, but in 9.2 there is transaction sharing mechanism, that allow you to share transaction in 1+ connections... Smth like:
- Open transaction
- Reconnect to self
- Share transaction
- Start async queries
- using async results, build endup query
- close connection(s) to self.
It seems that this is 1st step to PQO. So, it's "started"
If you are talking about dblink, I'm not sure you are talking about parallel queries that is mentioned here. This refers more to allowing the DBMS to use multiple processes on large/complex queries, and thus multiple processors (CPUs) allowing more horsepower on certain queries (you have to tell it to use parallel processing as part of the query hints). On big ETL processes this can speed things up dramatically (at the cost of possibly starving other processes for processors... but this is usually done in a controlled way, like at night on large overnight jobs).
You can share transaction in 9.2. This is 1st step to PQ. You can do it manually right now with dblink (connect to localhost, share transaction and make async query)
+1 on this.
this would be great feature! I also think it could improve the way Postgres works with partitioned tables. At least seq-scan on partitions could be done in parallel..
Crucial feature for creating large backups (databases larger than 200GB).
Виктор Михайлович commented
One of the most important things in my opinion for Postgres in DWH.
Agree, each query process should be threaded. This needs to happen sooner rather than later for PG.
I wish I could give this all 10 of my votes.