PostgreSQL
The world's most advanced open source database

..::Planet PostgreSQL::..

November 30, 2009

Dimitri Fontaine

prefix 1.1.0

So I had two bug reports about prefix in less than a week. It means several things, one of them is that my code is getting used in the wild, which is nice. The other side of the coin is that people do find bugs in there. This one is about the behavior of the btree opclass of the type prefix range. We cheat a lot there by simply having written one, because a range does not have a strict ordering: is [1-3] before of after [2-4]? But when you know you have no overlapping intervals in your prefix_range column, being able to have it part of a primary key is damn useful.

Note: in 8.5 we should have a way to express contraint exclusion and have PostgreSQL forbids overlapping entries for us. Not being there yet, you could write a constraint trigger and use the GiST index to have nice speed there, which is exactly what this constraint exclusion support is about.

It turns out the code change required is pretty simple:

-    return (a->first == b->first) ? (a->last - b->last) : (a->first - b->first);
+    /*
+     * we are comparing e.g. '1' and '12' (the shorter contains the
+     * smaller), so let's pretend '12' < '1' as it contains less elements.
+     */
+    return (alen == mlen) ? 1 : -1;

This happens in the compare support function (see Interfacing Extensions To Indexes) so that means you now have to rebuild your prefix_range btree indexes, hence the version number bump.

by Dimitri Fontaine at November 30, 2009 10:10 AM

Jeff Davis

Linux OOM Killer

The Linux OOM Killer heuristic can be summed up as:

  1. Run out of memory.
  2. Kill PostgreSQL.
  3. Look for processes that might be using too much memory, and kill them, hopefully freeing memory.

Notice that step #2 is independent of factors like:

  • Is PostgreSQL consuming a significant share of the memory resources?
  • Will killing PostgreSQL alleviate any memory pressure at all?

The reason for this is because linux, when under memory pressure, invokes the badness() function to determine which process to kill. One of the things that the function counts against a process is the total VM size of the process (no surprise), plus half of the total VM size of all of the children. That sounds reasonable — wait a minute, what about shared memory?

For every connection, PostgreSQL makes a child process, which shares memory with the parent; and if it is a big postgres instance, that can be a significant amount of memory. But the badness function counts the same shared memory against the parent 1+N/2 times, where N is the number of connections open! For example, if you have shared_buffers set to 1GB on an 8GB machine, and have 20 connections open, then the linux badness function thinks that PostgreSQL is using about 11GB of memory, when it is actually using 1GB!

It gets worse: killing a process does not free shared memory. And there are already administrator-controlled limits that start off fairly low (32MB, I think), so the administrator would have to make a mistake in order for shared memory to be the problem.

And it gets even worse: linux makes very little attempt to avoid getting into a bad situation in the first place. On most operating systems, if you ask for too much memory, malloc() will eventually fail, which will probably cause the application to terminate; or if the application is written to handle such conditions (like PostgreSQL), it will gracefully degrade. On linux, by default, brk() (the system call underlying malloc()) will happily return success in almost every situation, and the first hint that your application gets of a problem is a SIGKILL.

I have heard many excuses for this appalling set of behaviors, but none of them are satisfactory to me. Here is one explanation. Notice the implicit assumption is that the most important thing that a user may be running is a text editor, and it should just autosave every once in a while to avoid problems. Keeping processes actually running is apparently not important to Linux. The next general philosophy present in the email is that applications are stupid, and they will never get the error paths or rollback right, but PostgreSQL seems to do that just fine (it needs very little memory to roll back, and tries to free memory before that to make it less likely to run into a problem). And the author seems to think that the first hint of a system problem should be a SIGKILL based on some heuristic (which, in the case of linux, is seriously flawed, as I pointed out above).

Among other justifications are:

  • “You can configure linux to prevent this problem.” Sounds like MySQL, where you have to declare your tables to be transaction-safe. Why not safe by default, and configure for performance?
  • “Modern systems overcommit, and there are always edge cases where you get into problems.” OK, so keep them as edge cases, and at least attempt to avoid problem situations. If some bizarre set of circumstances, or maybe a malicious process, is able to cause memory problems on your system, then so be it. But well-behaved processes should get some indication of trouble prior to SIGKILL. And the real problem processes should be identified with some accuracy and reasonable intelligence (i.e. recognize that shared memory can’t be freed by the OOM Killer, and therefore shouldn’t be considered).

I have been complaining about this insanity for years:

It makes me happy that other free software operating systems are still under active development, as illustrated by the recent release of FreeBSD 8.0. I am not saying FreeBSD is better in every way than Linux, but I do believe that competition is important even in the free software world.

This post is obviously a rant, and an ill-informed one, at that. I am no expert on memory management, so let me know if I am way off-base. However, I have never heard reasonable justifications for these things in response to my emails to lkml, in personal discussions with kernel hackers or other knowledgeable Linux folk, or anywhere else. If you do have a reasonable explanation, I am more than willing to listen.

by Jeff Davis at November 30, 2009 04:07 AM

November 28, 2009

David Wheeler

JPUG Talk Posted

No Perl content today, I'm afraid. I'm just back from my trip to Japan and wanted to post this very nice video of my talk. Unlike the versions from other conferences, this one focuses primarily on the slides, with me appearing in audio only. This makes it really easy to follow. Enjoy.

Read More »

by David E. Wheeler at November 28, 2009 07:44 PM

Dave Page

JPUG 10th Anniversary conference pics

I finally got my photos from the JPUG 10th Anniversary conference in Tokyo online. Most are actually from the events before and after the actual conference which involved some sightseeing, and lots of discussions of PostgreSQL on topics such as infrastructure issues and server virtualisation, how to be more supportive of new developers, and how to make MERGE work nicely with the PostgreSQL rules system.

Breakfast sushi at 6AM!After hitting the limits on my free Flickr account I figured it was time to move to Smugmug, so you can find the full set of pics over there.

Thanks again to all the JPUG folks for an exceptional conference!

by Dave Page (noreply@blogger.com) at November 28, 2009 11:07 AM

Leo Hsu and Regina Obe

An almost idiot's guide to Install and Upgrade to PostgreSQL 8.4 with Yum

In this exercise, we'll go thru our steps for upgrading a Redhat Enterprise Linux 64-bit PostgreSQL box from PostgreSQL 8.3 to PostgreSQL 8.4. If you don't have any kind of PostgreSQL installed on your box, you can skip the Upgrade step.

Updgrading from PostgreSQL 8.* to PostgreSQL 8.4

If you are starting from scratch -- just skip this section.

If you are upgrading from PostgreSQL 8.3 to 8.4, in theory you can use PgMigrator, but in practice, particularly with a Yum install, you are bound to run into obstacles. If you are running an older version, you must dump and restore.


Continue reading "An almost idiot's guide to Install and Upgrade to PostgreSQL 8.4 with Yum"

by nospam@example.com (Leo Hsu and Regina Obe) at November 28, 2009 09:48 AM

November 27, 2009

Magnus Hagander

Feedback from pgday.eu

I've finally had the time to summarize the feedback we received from pgday.eu.

We received feedback from about 35 people, which is obviously way less than we were hoping for. Ideas for how to improve this for next time are very welcome! This also means that the figures we have are not very exact - but they should give a general hint about what our attendees thought.

I just sent out the individual session feedback summaries to each individual speaker. These will not be published - it's of course fine for each speaker to publish his own feedback if he wants to, but the conference organizers will not publish the detailed per-session data.

The statistics we do have show that most of our speakers did a very good job, and that the attendees were in general very happy with the sessions. We have also received a fairly large amount of comments - both to the conference and the speakers - which will help us improve specific points for next year!

I'll show a couple of graphs here with the total across all sessions and speakers. In these graphs, 5 is the highest score and 1 is the lowest.

724967910_MtQ4Q-O.png

The attendees also seemed to be very happy with our speakers, which is something I'm very happy to hear about. It's also good to see that almost nobody felt the speakers didn't know very well what they were talking about - always a worry with a conference that has so many experienced community people attending.

724967906_c8gwb-O.png 724967905_Ga5rN-O.png

Actually trying to figure out which speaker is best using this data is very difficult. But here's a list of the top speakers based on speaker quality, who had more than 5 ratings on their talks. The list includes all speakers with an average score of at least 3.5. There are a lot more hovering around that line, but there has to be a cutoff somewhere... Again note that there are still not that many ratings to consider, so values are pretty unstable. I've included the standard deviation as well to make sure this is visible.

Place Speaker Score Stddev Num
1 Gavin M. Roy 4.9 0.5 18
2 Guillaume Lelarge 4.9 0.4 7
3 Robert Hodges 4.8 0.4 13
4 Magnus Hagander 4.8 0.4 20
5 Jean-Paul Argudo 4.8 0.5 8
6 Joshua D. Drake 4.6 0.7 9
7 Simon Riggs 4.6 0.6 17
8 Dimitri Fontaine 4.5 0.5 14
9 Greg Stark 4.3 0.5 8
10 Vincent Moreau 4.1 0.6 8
11 Mark Cave-Ayland 4.0 0.6 11
12 David Fetter 3.9 1.1 9
13 Gabriele Bartolini 3.7 1.0 15
14 Heikki Linnakangas 3.6 0.7 9

All of these are clearly very good numbers.

So once again, a big thanks to our speakers for their good work. And also a very big thanks to those who did fill out the session feedback forms - your input is very valuable!

Update: Yes, these graphs were made with a python script calling the Google Charts API. Does anybody know of a native python library that will generate goodlooking charts without having to call a remote web service?

by nospam@example.com (Magnus Hagander) at November 27, 2009 11:46 AM

November 25, 2009

Pavel Stehule

Aggregate function MEDIAN in PostgreSQL

Searching and calculating Median in databases was terrible. Still median isn't ANSI SQL aggregate function. There are two commons method how to calculate median of some column. First - very old, and very slow based on self join alchemy, second - new - based on analytic function. Now, I will be test some newer methods on one million rows large table:
postgres=# create table milrows(a real);
CREATE TABLE
Time: 7,975 ms
postgres=# insert into milrows select random()*1000000 from generate_series(1,1000000);
INSERT 0 1000000
Time: 6863,575 ms
simple scan on this table takes 200ms
Frompostgres=# select avg(a) from milrows ;
avg
──────────────────
499515.883033113
(1 row)

Time: 200,176 ms
In 8.4 we can use analytic functions. These functions uses TupleStore - internal store feature - it allows work with very large tables - limit is free space on disc.

Analytic methods

--Joe Celko's method
postgres=# SELECT avg(a)::float
FROM (SELECT a, row_number() OVER (ORDER BY a asc) AS hi,
count(*) OVER () + 1 - row_number() OVER (ORDER BY a) AS lo
FROM milrows) qs
WHERE hi IN (lo-1,lo,lo+1);
avg
───────────────
499188.546875
(1 row)

Time: 4922,678 ms

-- Andrew Gierth's method
postgres=# select avg(a)
from ( select a, row_number() over (order by a),count(*) over () from milrows ) s
where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1)
;
avg
───────────────
499188.546875
(1 row)

Time: 5021,001 ms

-- modified Andrew's method (count(*) over () is slow)
postgres=# select avg(a)
from ( select a, row_number() over (order by a),(select count(*) from milrows) as count from milrows ) s
where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1)
;
avg
───────────────
499188.546875
(1 row)

Time: 3931,922 ms

Array based methods

Next methods are based on using an arrays. These methods are fast, but limit for this methods is size of operation memory. For very very large tables could to take all application memory.
--Regina's method -- it's not 100% correct http://www.postgresonline.com/journal/index.php?/archives/67-Build-Median-Aggregate-Function-in-SQL.html#extended

CREATE OR REPLACE FUNCTION array_median(double precision[])
RETURNS double precision AS
$$
SELECT CASE WHEN array_upper($1,1) = 0 THEN null
ELSE asorted[ceiling(array_upper(asorted,1)/2.0)]::double precision END
FROM (SELECT ARRAY(SELECT $1[n]
FROM generate_series(1, array_upper($1, 1)) AS n
WHERE $1[n] IS NOT NULL
ORDER BY $1[n]) As asorted) As foo
$$ LANGUAGE 'sql' IMMUTABLE;

CREATE AGGREGATE median(double precision) (
SFUNC=array_append,
STYPE=double precision[],
FINALFUNC=array_median
);

postgres=# select median(a) from milrows ;
^CCancel request sent
ERROR: canceling statement due to user request -- killed 5 minutes !don't use array_append for bigger arrays (length > 10000)
postgres=#

--My method
postgres=# create or replace function median(anyarray)
returns double precision as $$
select ($1[array_upper($1,1)/2+1]::double precision + $1[(array_upper($1,1)+1) / 2]::double precision) / 2.0;
$$ language sql immutable strict;
CREATE FUNCTION
Time: 1,557 ms

Time: 2574,677 ms
postgres=# select median(array(select a from milrows where a is not null order by a));
median
───────────────
499188.546875
(1 row)

Time: 2555,342 ms
This week I added support for median aggregate to orafce package. You can download it from url http://pgfoundry.org/frs/download.php/2472/orafce-3.0.2-devel.tar.gz . Function median use some fetures 8.4 and needs 8.4 - it isn't supported on PostgreSQL 8.3 and older.
-- orafce 3.0.2 median (needs PostgreSQL 8.4 and higher)

postgres=# select median(a::float8) from milrows;
median
───────────────
499188.546875
(1 row)

Time: 687,577 ms
It's very fast - if your table has about one million rows (1000000) you can use it (for this table size takes max. 15MB RAM (for one column)).

by Pavel Stěhule (noreply@blogger.com) at November 25, 2009 11:58 AM

Dimitri Fontaine

Yet Another PostgreSQL tool hits debian

So there it is, this newer contribution of mine that I presented at PGDay is now in debian NEW queue. pg_staging will empower you with respect to what you do about those nightly backups (pg_dump -Fc or something).

The tool provides a lot of commands to either dump or restore a database. It comes with documentation covering about it all, except for the londiste support part, which will be there in time for 1.0.0 release. The Todo list is getting smaller and smaller, the version you'll soon find in debian sid is already called 0.9.

So, how do you go about using this software, and what service it implements?

it's all about deriving a staging environment from your backups

To validate backups, you want to restore them and check the database you get from them. And your developers will want to sometime refresh the database they're working with. And you could have both an integration environment and a pre-live one: On the former, you develop new code atop a stable set of data; while on the latter you test stable enough code (ready to go live) on a set of data as near as live data as possible.

And you want to be flexible about it, so that there's not a fulltime job to handle retoring databases each and every days, for project A integration or project B pre-live testing, or project C accounting snapshot. Or you name it.

And of course you want to have a single point of control of all your databases. Let's call it the controler.

setting up pg_staging

The pg_staging setup consists of one pg_staging.ini file wherein you describe your different target databases (those dev and prelive ones), and of course where to get the production backups from. Currently you have to serve the backups file in a format suitable for pg_restore (that means you use either pg_dump -Ft or pg_dump -Fc) on an apache folder. The produced HTML will get parsed.

So you setup the DEFAULT section with common settings, then one section per target: the databases you want to restore. Tell pg_staging where they are (host), etc, and it'll be able to drive them.

In order to being able to host more than a single restored dump on a staging server, for the same database, we use pgbouncer:

pg_staging> pgbouncer some_db.dev
              some_db      some_db_20091029 :5432
     some_db_20090717      some_db_20090717 :5432
     some_db_20091029      some_db_20091029 :5432

So as explained into the pg_staging(1) man page, you have to open non-interactive SSH connection from the controler to the hosts where the databases will get restored. Then you have to do a minimal setup pgbouncer on the hosts with a trust connection. It'll get used from pg_staging for adding newly restored database and have them accessible. Then you can also switch the new database to being the virtual some_db so that you avoid editing any connection string on your softwares.

Also, install the pgstaging-client package on every host you target. The client is a simple shell script that must run as root (sudo is used) in order to replace your pgbouncer setup or manage your londiste services.

See man 5 pg_staging for available options, including schemas to filter out either completely or just skipping data restoring in those.

pg_staging usage

Now you're all setup, you can begin to enjoy using pgstaging. Enter the console and see what you have in there.

$ pg_staging
Welcome to pg_staging 0.9.
pg_staging> databases
...
pg_staging> restore some_db.dev
...
pg_staging> pgbouncer some_db.dev
...
pg_staging> dbsizes --all some_db.dev
...
pg_staging> psql some_db.dev
some_db_20091125=#

And as you can see in man pg_staging there are a lot of commands already. You can for example obtain a new pg_restore catalog from a dump file, with some schemas commented out. It will even comment out triggers that are using a function which is defined in a filtered out schema, for example a PGQ trigger. And much much more.

pg_staging will even allow you to dump your production databases, but consider installing a separate instance of it on the machine serving the backups to your local network thanks to an apache directory listing!

Roadmap to 1.0.0

What's remain to be done is testing and having PITR based restoring to work, and adding some documentation (tutorial, which this blog post about is; and londiste support). At this point, unless some reader here asks for a new feature (set), I'll consider pg_staging ready for 1.0.0. After all, we're using it about daily here :)

Consider commenting, you should be able to easily spot my private mail address...

by Dimitri Fontaine at November 25, 2009 09:49 AM

November 24, 2009

David Fetter

Pre-OSDC

Thanks to the Brisbane PHP/MySQL User Group for a lovely evening.

First, we...
Continue reading "Pre-OSDC"

by nospam@example.com (David Fetter) at November 24, 2009 02:07 PM

PostgreSQL Conference Japan...and onward

Finally, the people doing all kinds of clustering in PostgreSQL have met and sketched out the issues.
Expect great things to come from this...

And that was before the official conference even got started.

Thanks to all the people who made the JPUG 10th Anniversary great!
Continue reading "PostgreSQL Conference Japan...and onward"

by nospam@example.com (David Fetter) at November 24, 2009 02:03 PM