Python Dataframe to SQL Query

Question

I have developed a Python script that reads a CSV file which is a result of a SQL query (just a select * from table) and I perform some transformations and calculations on that dataframe.

I get the dataframe using the following Python commands:

result=csv_df.sort_values(by=['column1','column2','column3'],ascending=True)
result=result.drop_duplicates(['column1','column2'])

Now I need to get the same table using a SQL Query. I have tried the following in T-SQL but I have not been succesful.

select * from data
    where column1 IN
    (select distinct column1,column2 from data)
 and 
    where column2 IN
    (select distinct column1,column2 from data)
    order by column1,column2;

I am new to SQL syntax, can someone help me with the query?

What I am trying to do is delete all the duplicated rows from the combination of column1 and column2.

In Python the reason I include column3 is because it has NULL values that I need to discard.

After this should I create a view to keep on performing calculations?


Show source
| python   | sql-server   | sql   | pandas   2017-01-03 09:01 3 Answers

Answers ( 3 )

  1. 2017-01-03 13:01

    If I understood your question correctly, you can do it using ROW_NUMBER() function:

    with VirtTab as (
        select
          t.*,
          row_number()
            over(partition by column1, column2 order by column1, column2) as rn
        from data t
    )
    select * from VirtTab
    where rn = 1
    order by column1, column2;
    
  2. 2017-01-03 15:01

    From what I understand you require all records ordered by column1, column2 and column3:

    Select * from data order by column1,column2,column3
    

    Now, on top of this you want to remove duplicate rows in columns participantObjectId and slipObjectId. First partition columns on the basis of participantObjectId and slipObjectId. The query below wraps on top of the above query and adds another field id which provides unique values for each row of the data.

    select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id
    from (select * from data order by column1,column2,column3)
    

    On top of this we add another select statement with an added condition to choose only those rows with id equal to 1.

    select * from
    (select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id from 
    (Select * from data order by column1,column2,column3
    )) where id=1;
    
  3. 2017-01-04 02:01

    Assuming a unique ID in table, consider taking the record with lowest ID of matching column1 and column2 pairs:

    SELECT * FROM data AS main
    WHERE main.ID IN
        (SELECT sub.MinID FROM
           (SELECT column1, column2, Min(ID) As MinID
            FROM data
            GROUP BY column1, column2) AS sub)
    ORDER BY main.column1, main.column2;
    

    Alternatively, with JOIN:

    SELECT main.* FROM data AS main
    INNER JOIN 
        (SELECT column1, column2, Min(ID) As MinID
         FROM data
         GROUP BY column1, column2) AS sub
    ON main.ID = sub.MinID
    ORDER BY main.column1, main.column2;
    

    Even still, with EXISTS:

    SELECT main.* FROM data AS main
    WHERE EXISTS
      (SELECT 1 FROM 
          (SELECT column1, column2, Min(ID) As MinID
           FROM data
           GROUP BY column1, column2) sub
       WHERE main.ID = sub.MinID)
    ORDER BY main.column1, main.column2;
    

    And a non-Window function query using correlated count subquery (for potential compatibility with MySQL, SQLite, and MS Access). This version leaves out records if any of two columns are NULL:

    SELECT * FROM 
      (SELECT *, 
          (SELECT Count(*) FROM data sub
           WHERE sub.ID <= data.ID
           AND sub.column1 = data.column1
           AND sub.column2 = data.column2) AS rn
       FROM data) AS main
    WHERE main.rn = 1
    
◀ Go back