TABLESAMPLE

Published: 2020-04-17T18:00:00.000Z

Recently I've been working a bit with Postgres. I wanted to offload as much work as possible to the database. I had never seriously learned SQL before, so this was a first for me.

One of the things I wanted to do in SQL was taking a sample of 1'000 entities out of a big table with 70'000 rows.

SQL specifies the TABLESAMPLE clause to do just that (and Postgres supports this since 9.5).

SELECT *
FROM mytable
  TABLESAMPLE BERNOULLI (100 * (1000 / 70000));

The only thing that is a bit annoying is that it only supports percentages. But it's still pretty awesome.

In Postgres you can use SYSTEM or BERNOULLI as an algorithm.

Basically SYSTEM is quick and dirty, while BERNOULLI is more exact. Bernoulli Sampling is a method where each item in the entire population has an equal chance of being included in the sample.

SYSTEM will take a sample of storage pages instead (a page includes multiple items). It will go through each table storage page and decide whether to include the whole page or not.

Your result of a TABLESAMPLE SYSTEM might look like this:

 something_id 
--------------
        29766
        29767
        29768
        29769
        29770
        94348
        94349
        94350
        94352
        94353

An additional neat feature is the REPEATABLE option. You can set a seed value for the pseudo-random function with it. So when running the query with the same seed twice you're guaranteed to get the same result (hence the results are "repeatable").

SELECT *
FROM mytable
  TABLESAMPLE BERNOULLI (1) REPEATABLE (42);

This can also be very useful if you want to get an idea of what kind of data is stored in a table. Instead of doing LIMIT 1 and getting some bias in the data you're reading through, you can simply use a TABLESAMPLE.

SELECT *
FROM mytable
  TABLESAMPLE SYSTEM (0.1);