question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding sql functionality to pandas similar to spark sql

See original GitHub issue

I would like to add functionality to pandas so that data frames can be queried like database tables, similar to the way that they can be in spark-sql.

I think it should work in a similar fashion.

A table can be registered using register_temp_table(dataframe, table_name).

Then using pandas.query("select * from table_name") you can query the data frame or any other ones registered using standard sql syntax.

I’ve already implemented the entire thing, but I was told to open an issue for it.

Also I’m aware that there is a package called pandassql but this package actually just puts a data frame into a sql lite database, as opposed to querying a data frame directly, and transforming the sql into pandas methods that are then applied to the data frame.

Motivation: The motivation for this enhancement is to make pandas more accessible to a crowd of users that may not be as technical and also to provide ease of transition for legacy code in systems like sas that have SQL already embedded in their programs. I’ll supply a context free grammar in my documentation to show exactly what this system can handle, but it can basically handle any traditional SQL select statement, including subqueries, joins, where clauses, group by clauses, any aggregate function already supported by pandas, limit, and order by clauses. It also has support for rank and dense_rank window functions. It can’t do things that sql wouldn’t normally do like cross tab and you can’t use a user defined function in it although I think that could be a good add-on.

Datatypes: The interface supports all pandas datatypes, so to cast something as an integer the syntax would currently be cast(some_number as int64) or cast(some_int as object). I’ve played around with the idea of varchar, char, bigint and smallint, but I think those would be misleading as those aren’t datatypes that are supported by pandas currently.

Errors: Currently the exceptions that it will throw that come this api are based solely around trying to select from an unregistered table, or from submitting an improperly written sql query, both of which you wouldn’t want to silence so there’s only one error mode.

Api Choices: The reason I made the register_temp_table section of the api top level was to avoid attaching a method to DataFrame although if others think it might be better as a method, I would change it in that manner (DataFrame.register_temp_table(table_name)). The reason pandas.query is a top level method is that it’s relational in structure. You can select from multiple tables and join them and such and so it wouldn’t make sense for it to be on a DataFrame level. The only similarity to the .query DataFrame method though is the name. DataFrame.query is just an alternate way of expressing things like DataFrame[some_condition] whereas my .query encompasses a large amount of the pandas api.

Built In: I have two reasons that I think this would be better built in. The first is that the target audience for this is less technical pandas users. Part of making this api easier to use is lessening the burden of researching code and learning how python works, so I think that for them to go looking for an external package may be hard to begin with and they would also need to know to look for one. My second reason is that, from using what I’ve built, I’ve found pandas a lot easier to use just as a developer. Suppose we have a DataFrame with one column called A, it goes from This code:

dataframe[name_1] = dataframe[a] - 1
dataframe[name_2] = dataframe[a] + 1
dataframe = dataframe[dataframe[name_1] == dataframe[name_2]]
dataframe.drop(columns=['a'], inplace=True)

To this code: pd.query("select a - 1 as name_1, a + 1 as name_2 from some_table where name_1 = name_2")

Also although I did implement register_temp_table as an api level function, it would serve best as a method on a DataFrame so that’s another thing to consider.

I can’t really provide any support for the lark part, other than that it seemed like the best tool for what I was making.

I apologize for the style and such, I’ll be fixing all that before I’m done. I implemented this outside of pandas first, so that’s why there are so many style and documentation discrepancies.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:25 (23 by maintainers)

github_iconTop GitHub Comments

2reactions
devin-petersohncommented, Dec 26, 2019

Thanks for the ping @datapythonista I will give some thoughts here.

Dataframes are not relational tables, which means that the SQL equivalents of pandas API calls will be slightly different semantically. Spark does not have a “true” dataframe in the pure sense like pandas or R, but that is a different discussion. It is worth mentioning here because there was a comparison to Spark in the thread.

First there is the question of optimization. If you are simply translating SQL into pandas, there will be no optimization because pandas does not have a query optimizer. The purpose to writing SQL is to provide the system with the entire query upfront to get the optimal execution runtime. You will either have to: (1) accept lower performance for suboptimally written SQL queries, or (2) have to write your own query optimizer. There are some simple query rewriting steps that can be implemented with relatively little engineering overhead.

Second, SQL has different semantics than dataframes. For example, T1 JOIN T2 != T2 JOIN T1. This is because in a dataframe there is an implicit order that the relational data model does not have.

Modin is an academic project, so we are taking a more principled approach to solving some of these problems. We started with treating all APIs as a separate layer and querying the underlying, drilled-down version of the dataframe API, and putting the optimizations at that layer. I’m familiar with the differences between relational tables and dataframes because we are formalizing them.

I think the biggest reason @datapythonista and others would like it not to be in pandas is because there is the question of who will maintain it. If they pull it in, they are agreeing to maintain it themselves, which is a big ask. This is why it should start as an outside project and potentially be brought in later. That said, a lot of people would like this so it would probably generate a large amount of interest if you are willing to maintain it longer term, especially if you made it possible to use with the other projects @mrocklin mentioned.

1reaction
wesmcommented, Dec 26, 2019

Ideally I would like to see a pandas-independent SQL parser that can generate a reasonable logical query plan (similar to stuff we’ve done in https://github.com/ibis-project/ibis – we never have tried parsing SQL, instead the inverse: opting to model SQL semantics with a pandas-like DSL) that can be mapped to data frame operations. To save you some time, you could consider creating a Cython binding for Google’ ZetaSQL project (https://github.com/google/zetasql) which powers BigQuery (I haven’t investigated how easy this is to use though). Then you aren’t having to maintain your own parser and SQL query plan implementation yourself.

Apache Calcite is another project which provides this functionality, but it’s written in Java and while some native-code systems (Apache Impala and OmniSciDB being notable exxamples) have implemented bindings to use it, to have a JNI-dependency might be unpalatable.

It would be sad IMHO if there were a SQL->Result implementation that is tightly coupled to a single dataframe-like project.

cc @emkornfield who might know more about ZetaSQL and can comment

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark SQL and DataFrames - Spark 2.3.0 Documentation
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs....
Read more >
4. Spark SQL and DataFrames: Introduction to Built-in Data ...
To issue any SQL query, use the sql() method on the SparkSession instance, spark , such as spark.sql("SELECT * FROM myTableName") . All...
Read more >
Pandas vs PySpark DataFrame With Examples
Let's learn the difference between Pandas vs PySpark DataFrame, their definitions, features, advantages, how to create them and transform one to another.
Read more >
Convert between PySpark and pandas DataFrames
Convert PySpark DataFrames to and from pandas DataFrames · spark.sql.execution.arrow.pyspark.enabled to · true . This configuration is enabled by ...
Read more >
SQL at Scale with Apache Spark SQL and DataFrames
The DataFrame API is very powerful and allows users to finally intermix procedural and relational code! Advanced functions like UDFs (user ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found