hat tip: join two spark dataframe on multiple columns (pyspark) | [Deprecated] visit: firooz.us/blog

hat tip: join two spark dataframe on multiple columns (pyspark)

Labels: Big data, Data Frame, Data Science, Spark

Thursday, September 24, 2015

Consider the following two spark dataframes:

df1.show()
+----+------+-------+
|id_a|time_a|value_a|
+----+------+-------+
|   1|     1|     CA|
|   1|     2|     CA|
|   2|     1|     TX|
|   3|     5|     NE|
|   4|     6|     WA|
+----+------+-------+

df2.show()
+----+------+-----------+
|id_b|time_b|    value_b|
+----+------+-----------+
|   1|     1|   San Jose|
|   2|     1|Los Angeles|
|   2|     2|     Austin|
+----+------+-----------+

Now assume, you want to join the two dataframe using both id columns and time columns. This can easily be done in pyspark:

df = df1.join(df2,(df1.id==df2.id_b)&(df1.time==df2.time),joinType="inner")

df.show()
+----+------+-------+----+------+-----------+
|id_a|time_a|value_a|id_b|time_b|    value_b|
+----+------+-------+----+------+-----------+
|   1|     1|     CA|   1|     1|   San Jose|
|   2|     1|     TX|   2|     1|Los Angeles|
+----+------+-------+----+------+-----------+

Note that parentheses around the conditions is absolutely necessary.

Copyright © 2015 • [Deprecated] visit: firooz.us/blog