Postgresql - How to get the First N values of multiple columns?

David542
April 11, 2024
237 views
1 vote
2 Answers

Let’s say I have the following table:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) SELECT * FROM tbl

How would I get N values for each column? For example, to display a sample of values without having to run a query for each column? That is, I’d like to grab them all in one-go. So far I have something like:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
) 
SELECT 
    ARRAY_AGG(DISTINCT id LIMIT 2),
    ARRAY_AGG(DISTINCT product LIMIT 2),
    ARRAY_AGG(DISTINCT cost LIMIT 2)
FROM tbl

This works, but it seems very inefficient (I believe the same as running a query for each column). What is a better way to do this?

Or, to generalize what I think to be a poor approach but applicable outside BQ:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)  
select 'id' as field, array(select distinct cast(id as string) from tbl limit 2) as values union all
select 'product', array(select distinct cast(product as string) from tbl limit 2) union all
select 'cost', array(select distinct cast(cost as string) from tbl limit 2);

Answers

- ErwinBrandstetter
- April 11, 2024 at 6:45 am
- 0 votes
0
Your question leaves room for interpretation. You mention "first", but don’t define it. Your query has DISTINCT without mentioning anything about that earlier. Your sample shows neither null values nor duplicates, it’s unclear how to deal with those.

Using a subquery in Postgres, this runs a single, very cheap sequential scan and stops at the small LIMIT:
```
SELECT array_agg(id      ORDER BY id     ) AS ids
     , array_agg(product ORDER BY product) AS products
     , array_agg(cost    ORDER BY cost   ) AS costs
FROM  (
   SELECT id, product, cost 
   FROM   tbl
   -- no ORDER BY, take arbitrary rows cheaply
   LIMIT  2
   ) sub;
```
A more representative sample can be had with the "ordered-set aggregate function" percentile_disc(). Like:
```
SELECT percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY id     ) AS pctl_id
     , percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY product) AS pctl_product   
     , percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY cost   ) AS pctl_cost
FROM   tbl;
```
This variant of percentile_disc() takes float8[] – an array of percentages – and picks a column value at each given percentage within the defined sort order. Rather sophisticated. You can pick any number of column values from any position in the sort order.

This way, you can chose where in the sort order and how many values to pick from each column, while still scanning the table once.
Or, for big tables, based on a small semi-random sample to make it cheaper (while less representative):
```
SELECT percentile_disc('{0,.5,1}'::float[]) WITHIN GROUP (ORDER BY id     ) AS pctl_id
     , ...
FROM   tbl  TABLESAMPLE SYSTEM (10);
```
fiddle

percentile_disc() ignores null values, array_agg() includes them.
Duplicates get no special treatment (yet) in any of these. Any combination of these features (and more) is possible to optimize for performance, randomness, uniqueness, validity, …
You just need to define exactly what you need.

See:
- Fast way to discover the row count of a table in PostgreSQL
- Can someone give examples of use case of the ordered-set functions in PostgreSQL 9.4?
BigQuery seems to only provide the variant of percentile_disc() getting a single value.
Login or Signup to reply.

- MikhailBerlyant
- April 11, 2024 at 7:28 am
- 0 votes
0
Consider below as an option (BigQuery) – fast and cheap and have no dependency on number of columns!!
```
WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) 
SELECT name, quantiles AS num_values, ARRAY(SELECT VALUE FROM t.top_values) AS string_values
FROM ML.DESCRIBE_DATA(
  (SELECT * FROM tbl),
  STRUCT(1 AS num_quantiles, 2 AS top_k)
) t;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to get the First N values of multiple columns?

Answers