We tend to use three different languages. In any project, the language is very often determined by the Client's pre-existing environment. Where there is a choice, it is important to understand the strengths and weaknesses of the languages available. Before getting into details, let's consider the famous "Sapir-Whorf" hypothesis (also known as the theory of linguistic relativity). This states:
The grammatical and verbal structure of language influence how the speaker perceives the world.
In the realm of spoken languages, this is a controversial hypothesis. Programming languages are much more narrow in scope and application. And here, it is definitely the case that some problems are more easily expressed in one language than another. Picking the language that best matches the problem will make the development process smoother and more efficient.
As a case study, the following looks at some of the operations commonly performed in SQL and their Pandas equivalents. The differences illustrate the strengths and weaknesses of each. A few things to note from the get go. SQL is at its core set based. The concept of ordering of data within a table is supplied by later add-ons to the language. By contrast the Pandas array is intrinsically ordered - rows can be accessed via their position in the array. Both Pandas arrays and sql tables can be indexed. However an sql index relates to one or more columns in a table while in Pandas the index is a set of data distinct from the contents of the array. Therefore there will be different methods employed in Pandas depending on whether we are querying using the index vs using data column.
selecting a subset of the columns
sql: SELECT a,b FROM x;
pd: x[['a','b']]
selecting columns by exclusion - useful for wide tables, not available in most implementations apart from recent Teradata.
sql: SELECT * FROM (ANTISELECT ( ON x USING EXCLUDE ('c') ) AS x1;
pd: x.drop('c', axis=1)
sql: SELECT a,b FROM x ORDER BY a ASC, b ASC;
pd: x.sort_values(by=['a','b'], ascending=True)
sql: SELECT * FROM x WHERE a>5;
pd: x[x.a>5] #using a data field
x[x.index>5] #using the index
sql: SELECT a, SUM(b) FROM x GROUP BY a;
pd: x.groupby('a').sum()[['b']].reset_index()
reset_index is optional it causes the grouping variable to be shown as a column in the output array rather than the default where it appears as the index field.
In SQL there are a range of aggregation functions available (e.g. MIN, MAX, STD), the exact list depends on the platform. Typically it is not straightforward to implement a custom aggregation function, if it is possible at all. The custom function is created in another context and then imported. The creator and the user will both require sufficent privileges to create/use the new function. By contrast this is all very easy in Pandas.
sql: SELECT a, F(b) FROM x GROUP BY a;
pd: x[['a','b']].groupby('a').transform(lambda s: s.max() - s.min())
x[['a','b']].groupby('a').transform(scipy.stats.kurtosis)
Filtering rows after grouping: in sql we use the having clause, in pandas we chain operations - grouping followed by a filtering. If we want to do this all in one command there is the problem that we do not have an explicit name for the intermediate results following the grouping. For this reason we deploy the 'query' command.
sql: SELECT a, SUM(b) FROM x GROUP BY a HAVING SUM(b)>10;
pd: x.groupby('a').sum()[['b']].reset_index().query('b>10')
SQL has four set operations - UNION ALL, UNION, INTERSECT and EXCEPT. For these, the entire row is treated as an element within a set. The SQL set operations match up columns according to position in the source table, if columns in matching positions are not of compatible data types the operation will fail. By contrast Pandas aligns columns based on name, adding nans as appropriate where no name match found.
Union All
sql: SELECT * FROM x UNION ALL SELECT * FROM y;
pd: pd.concat( x, y, axis=0)
Union
sql: SELECT * FROM x UNION SELECT * FROM y;
pd: pd.concat(x, y, axis=0).drop_duplicates()
Intersect
sql: SELECT * FROM x INTERSECT SELECT * FROM y;
pd: x[x.isin(y).all(axis=1)]
Except
Pandas has no one-step function, so we achieve the operation by combining first finding the intersection then removing that from first array.
sql: SELECT * FROM x EXCEPT SELECT * FROM y;
pd: x[~(x.isin(y).all(axis=1))]
Joins represent set operations applied to the fields selected for joining. An "inner join" corresponds to the intersection of the join fields, an "outer join" corresponds to a union of the join fields.
Inner Join
Different sql vary in how they handle name clashes on non-join columns. Pandas adds underscore+table name as suffix to column name. (NB syntax for left and right joins are similar). Duplicates in the join fields lead to a cartesian product in sql whereas in Pandas duplicates are matched in order.
sql: SELECT * FROM x INNER JOIN y ON x.a=y.a AND x.b=y.b;
pd: pd.merge(x,y, how='inner', on=['a', 'b'])
pd.merge(x,y, how='inner', left_index=True, right_index=True)
Outer Join
Note pandas will coalesce the join fields into a single column while sql while output the fields from both tables.
sql: SELECT * FROM x OUTER JOIN y ON x.a=y.a;
pd: pd.merge(x,y, how='outer', on='a')
pd.concat((x,y), axis=1)
Outer Join followed by Arithmetic Operations
sql: SELECT COALESCE(x.a, y.a), x.b+y.b FROM x OUTER JOIN y on x.a=y.a;
pd: x+y
x.add(y, fill_value=0) #replace nans with zero to prevent propagation
Pandas has the simplest syntax in cases where the join data is in the index. Pandas automatically aligns rows with matching index values. (NB numpy is not the same, it aligns on position). The SQL select extends simply when it comes to multiple heterogenous calculations. For pandas we need to fall back to using merge and then calculate using the resulting dataframe (dropping columns from the final output as necessary). To do everything in one statement means using assign. We have the problem of how to identify the columns in the intermediate results: although we know the names (a_x, a_y etc) we cant use these directly. The solution is to use an inline lambda function definition (combine_first is equivalent to coalesce)
sql: SELECT COALESCE(x.a, y.a), x.b+y.b FROM x OUTER JOIN y on x.a=y.a;
pd: pd.merge(x,y, how='outer', on='a').assign(
new1 = lambda df: df.b_x + df.b_y,
new2 = lambda df: df.c_x.combine_first(df.c_y))
Cartesian Join
The Pandas syntax differs depending on the version
sql: SELECT * FROM x, y;
SELECT * FROM x CROSS JOIN y;
pd: pd.merge(x,y, how='cross') #new pd
x.assign(joincol=1).merge(y.assign(joincol=1), on=['joincol']).drop('dummy', axis=1) #older pd
Multiple Inner Joins With Common Indexing Field
Pandas has a slight advantage, concat can take any number of arguments while SQL must laboriously repeat for each new table (although bear in mind that the inner join clause is more flexible)
sql: SELECT * FROM x INNER JOIN y on x.a=y.a INNER JOIN z on x.a=z.a;
pd.concat((x,y,z), axis=1, join='inner')
Join on an Inequality
Easy to do in SQL. No direct syntax in pandas - best approach is to cartesian-join to get all possible matches then pare down with a filter.
sql: SELECT * FROM x INNER JOIN y ON x.a<=y.a
pd: pd.merge(x,y, how='cross').query('a_x<=a_y')
Note Pandas transform loses the grouping columns so we need to assign back to the original dataframe if we want all the data together. Note also that transform has the ability to 'broadcast' back to the shape of the input group (as here) but can do a lot more.
sql: SELECT a, b, SUM(c OVER PARTITION BY a, b) as c_subtotal FROM x;
pd: x.assign(c_subtotal = x.groupby(['a','b']).transform(sum))
sql: SELECT a, b, RANK() OVER (PARTITION BY a,b ORDER BY c) as rnk FROM x;
pd: x.assign(rnk = x.groupby(['a','b'])['c'].rank(method='max'))
By default, "rank" in pandas gives half values to ties. The max method instead rounds tie-ranks up. Pandas has a range of built-ins covering operations needing window functions in sql. "Expanding" is equivalent to "Rows between unbounded preceding and current row". "Rolling" is equivalent to rows between n preceding and n following. To get an asymmetric window would need rolling combined with shift. Note, if a date index is defined, the Pandas functions will work based on the dates, and will therefore diverge from sql if there are gaps in the date series
pd: x['c'].shift(1)
x['c'].cumsum()
x['c'].expanding.mean()
x['c'].rolling(52).mean()
It is important to note that Pandas requires the array to be sorted (usually, but not necessarily, this will be the case when dealing with time series data). If the data is not sorted, a sort can be embedded but care must be taken since the functions should operate on the intermediate sorted result not the original array. In the following, "sort_values" returns a copy of the array, "cumsum" operates on this sorted version but "assign" uses the index to find the right row in the original. Aside from the extra code complexity, note that creating intermediate copies of the data is itself a costly operation. It may therefore make sense to carry out a one-off sort of the data or ensure the data is loaded in sorted order.
x.assign(cumulative = x.sort_values(by=['a','b'], axis=0, ascending=True).cumsum()['c'])
x['c'].cumsum() #if already sorted
SQL syntax varies depending on implementation
sql: SELECT TOP 5 * FROM x; --not true sampling, could return anything
SELECT * FROM x LIMIT 5; --ditto
SELECT * FROM x SAMPLE 5;
SELECT * FROM x TABLESAMPLE (5 ROWS);
SELECT * FROM x WHERE RAND()<0.10; --10% sample, doesnt work in all implementations
pd: x.sample(5, axis=0, replace=False)
No specific built in sql functionality instead we need to join to a calendar having the appropriate granularity and date range.
sql: SELECT x.calendar_dt, sum(y.a)
SQL starts to run out of steam here. There are various add-ons to core language. The first line below shows that extracting the data itself is easy. However making a table requires tedious code and importantly needs prior knowledge of the range of values held be the column categorical variable.
sql: SELECT a, b, SUM(c) FROM x GROUP BY a, b; --right data but in the wrong format
pivot table specif syntax from SQLserver - still needs hardcoding of column variables unless dynamic sql is employed.
sql: SELECT a, [1], [2], [3]
Teradata is a bit better
sql: SELECT * from x
Pandas wins hands down in this one
pd: x.pivot_table(index=x.a, columns=x.b, aggfunc=np.sum, fill_value=0)
A requirement for successful matrix multiplication is that the number of columns in the first array (x) must equal the number of rows in the second array (y). In SQL we need to convert the arrays to and from a relational format.
WITH x1 as
Pandas is much more succint
pd: x.dot(y)
FROM sys.calendar x LEFT JOIN y on x.calendar_dt = y.dt
GROUP BY x.calendar_dt
pd: x.sample('D').sum() #requires x has a periodic index defined
SELECT a,
sum(CASE WHEN b=1 THEN c ELSE 0 END) as b1,
sum(CASE WHEN b=2 THEN c ELSE 0 END) as b2,
...etc etc
GROUP BY a;
FROM (SELECT a, b, c FROM x) AS source_table
PIVOT (SELECT SUM(c) FOR b in ([1],[2],[3])) AS pivot_table
PIVOT (SUM(c) for b in (Select distinct b from x))) pivot_table;
Matrix Multiplication
(
--convert first matrix to 'tall' or stacked format
SELECT row, 1 as col, a as val FROM x
UNION
SELECT row, 2 as col, b as val FROM x
),
y2 as
(
--convert second matrix to 'tall' or stacked format
SELECT row, 1 as col, a as val FROM y
UNION
SELECT row, 2 as col, b as val FROM y
),
tall as
(
--results of multiplication in tall format
SELECT x.row, y.col, SUM(x.val * y.val) AS val
FROM x1
INNER JOIN y1 ON x.col = y.row
GROUP by x.row, y.col
)
--results need pivotting to convert from tall format to matrix
SELECT * FROM tall
PIVOT (SUM(val) FOR col IN (SELECT DISTINCT b col FROM tall)) pivot_table;
Conclusions
No suprises - relational operations are a bit more easily constructed and read in SQL. Operations requiring row ordering are easiest in Pandas. Array based operations are also easiest in Pandas reflecting its inheritance from Numpy.