Is querying a set number of columns faster for a small table than a large table?

Question

I am designing a database that will include a high score list and need the SQL queries to be fast for the list.

There will be a 1-to-1 relationship between an entry on the high score list and the users, but I don't need all the user info for the list itself.

So, I'm wondering whether design will be faster for querying the database.

Setup 1:

2 tables with 1-to-1 relationship.

users table: ID (PK) - name - email - entryID (FK)

entries table: ID (PK) - score - message - userID (FK)

Query:

SELECT score, message
FROM entries
ORDER BY score desc

Setup 2:

1 table

users table: ID - name - email - score - message

Query:

SELECT score, message
FROM users
ORDER BY score desc

What I'm wondering is: Does the number of columns in a table affect the speed of a query, when you're selecting a subset of the columns?


Show source
| database   | postgresql   2016-12-23 12:12 3 Answers

Answers to Is querying a set number of columns faster for a small table than a large table? ( 3 )

  1. 2016-12-23 12:12

    if your table has no indexes, the number of columns you have in query does not matter - it scans blocks with all columns. the only impact would be in return time - the amount of data sent to client will be different.

    If you plan to have indexes and list of columns in query contains indexed one - then it does matter. Eg with 9.6 you can hit Index only scan...

    Ah, yes, it does matter for columns with TOASTed values.

    short example:

    t=# create table s09 (i int, a text);
    CREATE TABLE
    t=# insert into s09 select generate_series, 'text' from generate_series(1,9999999,1);
    INSERT 0 9999999
    t=# analyze s09;
    ANALYZE
    t=# explain analyze select * from s09;
                                                       QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------
     Seq Scan on s09  (cost=0.00..154053.58 rows=9999858 width=9) (actual time=0.010..1712.339 rows=9999999 loops=1)
     Planning time: 0.046 ms
     Execution time: 2825.514 ms
    (3 rows)
    
    t=# explain analyze select i from s09;
                                                       QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------
     Seq Scan on s09  (cost=0.00..154053.58 rows=9999858 width=4) (actual time=0.010..1828.329 rows=9999999 loops=1)
     Planning time: 0.028 ms
     Execution time: 2937.231 ms
    (3 rows)
    
  2. 2016-12-23 12:12

    From the comments I found the answer to this question. It comes down to the disk IO and the cache hit/miss ratio.

    If the table row is small, the number of rows read per disk IO is higher. Therefore, working with the data in memory (selecting the 2 columns and ignoring the others) will be faster, because it requires fewer trips to the disk.

    So, even though the amount of data returned is the same, if you can keep the row size below the disk IO read block size you will have a slight performance increase.

    Thanks to JosMac for getting me on track.

    Edit: Adding analyze test cases and results.

    Test case 1, with 2 different tables but only selecting score, message from one table:

    create table users (
        ID int not null primary key,
        name varchar(50) not null,
        email varchar(50) not null,
        sex int,
        age int,
        country varchar(50),
        occupation varchar(50),
        handedness int);
    
    
    create table entries (
        ID int not null primary key,
        score int not null,
        message varchar(140) not null,
        userID int not null references users(ID));
    
    create index entry_scores
    on entries(score);
    
    insert into users
    select  generate_series,
        'name',
        'email@test.com',
        CASE generate_series % 3
            WHEN 0 THEN
                null
            WHEN 1 THEN
                1
            WHEN 2 THEN
                2
        END,
        generate_series,
        'Some country',
        'some occupation',
        CASE generate_series % 3
            WHEN 0 THEN
                null
            WHEN 1 THEN
                1
            WHEN 2 THEN
                2
        END from generate_series(1, 1000000, 1);
    
    insert into entries
    select generate_series, generate_series, 'some message', generate_series from generate_series(1, 1000000, 1);
    
    explain analyze select score, message from entries;
    

    This returned the message:

    Seq Scan on entries  (cost=0.00..9117.72 rows=176472 width=302) (actual time=0.019..393.597 rows=1000000 loops=1)
    

    Test case 2, with one large table but only selecting score, message from it:

    create table users (
        ID int not null primary key,
        name varchar(30) not null,
        email varchar(30) not null,
        sex int,
        age int,
        country varchar(50),
        occupation varchar(50),
        handedness int,
        score int not null,
        message varchar(140) not null);
    
    create index user_scores
    on users(score);
    
    insert into users
    select  generate_series,
        'name',
        'email@test.com',
        CASE generate_series % 3
            WHEN 0 THEN
                null
            WHEN 1 THEN
                1
            WHEN 2 THEN
                2
        END,
        generate_series,
        'Some country',
        'some occupation',
        CASE generate_series % 3
            WHEN 0 THEN
                null
            WHEN 1 THEN
                1
            WHEN 2 THEN
                2
        END,
        generate_series,
        'some message' from generate_series(1, 1000000, 1);
    
    explain analyze select score, message from users;
    

    This returned the message:

    Seq Scan on users  (cost=0.00..15857.46 rows=157146 width=302) (actual time=0.012..485.094 rows=1000000 loops=1)
    

    This shows that setup 1 is faster, presumably because it saves disk IO reads by having a smaller row size.

    Edit 2: Including correct assessment of the analysis.

  3. 2016-12-23 13:12

    In general if you have two queries, where Q1 selects all columns Q2 selects and at least another column, other setups are identical, then Q2 will clearly perform better than Q1, since:

    • Q2 will have less trips to the disk
    • Q2 will have a higher amount of data to be loaded, as the given value for each resulting record will have to be loaded
    • If Q1 and Q2 are used as subqueries then the number of columns to be loaded could significantly increase load time. I am not sure about this one, though, I might be wrong if the RDBMS is executing the subquery only once in each and every case
    • The column values will be stored in memory, which takes time by itself and memory as well
    • The values stored in memory must be all sent

    However, the size not really depends on the number of columns, but on the average sum of their sizes. Performance depends both on total size and trip number. In general it is better to read only what is needed both from performance's point of view, but from security considerations as well.

    In your question I think the first example is inaccurate and you have a join in reality. If you join by indexed foreign keys, then the query should be reasonably quick, but not quicker than a query from a single table. However, you might need to separate data into several tables if they can cause inconsistency or redundancy. So NF is something you should apply to your DB.

Leave a reply to - Is querying a set number of columns faster for a small table than a large table?

◀ Go back