hat tip: join two spark dataframe on multiple columns (pyspark)




Thursday, September 24, 2015
Consider the following two spark dataframes:
df1.show()
+----+------+-------+
|id_a|time_a|value_a|
+----+------+-------+
|   1|     1|     CA|
|   1|     2|     CA|
|   2|     1|     TX|
|   3|     5|     NE|
|   4|     6|     WA|
+----+------+-------+
df2.show()
+----+------+-----------+
|id_b|time_b|    value_b|
+----+------+-----------+
|   1|     1|   San Jose|
|   2|     1|Los Angeles|
|   2|     2|     Austin|
+----+------+-----------+
Now assume, you want to join the two dataframe using both id columns and time columns. This can easily be done in pyspark:
df = df1.join(df2,(df1.id==df2.id_b)&(df1.time==df2.time),joinType="inner")
df.show()
+----+------+-------+----+------+-----------+
|id_a|time_a|value_a|id_b|time_b|    value_b|
+----+------+-------+----+------+-----------+
|   1|     1|     CA|   1|     1|   San Jose|
|   2|     1|     TX|   2|     1|Los Angeles|
+----+------+-------+----+------+-----------+
Note that parentheses around the conditions is absolutely necessary.
 

Favorite Quotes

"I have never thought of writing for reputation and honor. What I have in my heart must out; that is the reason why I compose." --Beethoven

"All models are wrong, but some are useful." --George Box

Copyright © 2015 • Hamed's Ensemble Blogging