Recently I've been working a bit with Postgres. I wanted to offload as much work as possible to the database. I had never seriously learned SQL before, so this was a first for me.
One of the things I wanted to do in SQL was taking a sample of 1'000 entities out of a big table with 70'000 rows.
SQL specifies the TABLESAMPLE
clause to do just that (and Postgres supports this since 9.5).
SELECT *
FROM mytable
TABLESAMPLE BERNOULLI (100 * (1000 / 70000));
The only thing that is a bit annoying is that it only supports percentages. But it's still pretty awesome.
In Postgres you can use SYSTEM
or BERNOULLI
as an algorithm.
Basically SYSTEM
is quick and dirty, while BERNOULLI
is more exact. Bernoulli Sampling is a method where each item in the entire population has an equal chance of being included in the sample.
SYSTEM
will take a sample of storage pages instead (a page includes multiple items). It will go through each table storage page and decide whether to include the whole page or not.
Your result of a TABLESAMPLE SYSTEM
might look like this:
something_id
--------------
29766
29767
29768
29769
29770
94348
94349
94350
94352
94353
An additional neat feature is the REPEATABLE
option. You can set a seed value for the pseudo-random function with it. So when running the query with the same seed twice you're guaranteed to get the same result (hence the results are "repeatable").
SELECT *
FROM mytable
TABLESAMPLE BERNOULLI (1) REPEATABLE (42);
This can also be very useful if you want to get an idea of what kind of data is stored in a table. Instead of doing LIMIT 1
and getting some bias in the data you're reading through, you can simply use a TABLESAMPLE
.
SELECT *
FROM mytable
TABLESAMPLE SYSTEM (0.1);