[This post is a draft for Table.co technical blog]
Graphs can be used as proper data structure in many applications from marketing data to social networks. It is amazing how representing data in terms of vertex and edges make the problem easier to solve and scalable. That is why I am supper excited about the Spark Graphframes. It is a graph processing package that sits on top of Spark Dataframe and can be considered as an extension to GraphX (which is based on RDDs). The cool thing about Graphframes is that you can run both graph algorithms and graph queries (pattern matching) within the same framework.
The purpose of this post is to show how how one can use power of pattern matching in Graphframe for recommendation in social networks. For that we use a toy dataset that is inspired by the data at Table.co.
What is our network?
Consider a network with two type of objects: 1- users 2- tables.
Users can follow each other and it is one way connection.
Tables are similar to private chatrooms. They contain both private and public messages. A user can only follow a table which gives him/her access to public messages OR a user can be a member of table which means access to all messages and also privilege to send message to other members and follower of the table.
Let assume we have the following simple network of users and tables:
In this toy graph, we have 6 users and three tables. A lot can be read from this network. For example, Med follows Joe and he is a member of DesignMe table. Bob is a follower of Jerald and he is also followed by Jerald. He follows DesignMe table.
Put it in Graphframes
First lets define this graph in Graphframes. For that we need to first define two dataframes: 1- vertex dataframe 2- edge dataframe.
Vertex dataframe:
Each node in the graph would have an unique id
. All nodes have a type
field to identify the type of node; Is it a user node or table node. Furthermore, all nodes have a name
which is their human readable identity. Finally, there is a extra
filed for each node object which has a json value and it must be interpreted based on the type
of the node. The vertex dataframe for graph in Figure 1 can be found as follows:
all_members = \
sqlc.createDataFrame([
{'id':1, 'name': 'Bob', 'type': 'person', 'extra':None},
{'id':2, 'name': 'Alice', 'type': 'person', 'extra':None},
{'id':3, 'name': 'Med', 'type': 'person', 'extra':None},
{'id':4, 'name': 'Joe', 'type': 'person', 'extra':None},
{'id':5, 'name': 'Jerald', 'type': 'person', 'extra':None},
{'id':6, 'name': 'Derek', 'type': 'person', 'extra':None},
{'id':7, 'name': 'SF photography', 'type': 'table', 'extra':None},
{'id':8, 'name': 'DesineMe', 'type': 'table', 'extra':None},
{'id':9, 'name': 'California Art', 'type': 'table', 'extra':None},
{'id':10, 'name': 'Radius', 'type': 'table', 'extra':None},
])
Edge dataframe:
Nodes in graph can be connected to each other. Based on type of the object, the connection can be different with different information and property. The common denominator for connection properties in all of the connections are 1- source id 2- destination id 3- type of the connection.
As it is mentioned before and it depicted in Figure 1, we have three types of connection in our network.
1- person follows a person (connection
): If one person is following another person
2- person follows a table (follow
): a person only have access to public message of table
3- person is member of table (member
): person can post a message to the table and see all other messages in the table.
Then, the edge information can be described in a dataframe as follows:
all_conections = \
sqlc.createDataFrame([
{'src':1, 'dst': 2, 'type': 'connection'},
{'src':1, 'dst': 5, 'type': 'connection'},
{'src':1, 'dst': 6, 'type': 'connection'},
{'src':1, 'dst': 7, 'type': 'follow'},
{'src':1, 'dst': 8, 'type': 'follow'},
{'src':1, 'dst': 9, 'type': 'follow'},
{'src':2, 'dst': 1, 'type': 'connection'},
{'src':2, 'dst': 6, 'type': 'connection'},
{'src':2, 'dst': 5, 'type': 'connection'},
{'src':2, 'dst': 4, 'type': 'connection'},
{'src':2, 'dst': 3, 'type': 'connection'},
{'src':2, 'dst': 7, 'type': 'follow'},
{'src':2, 'dst': 8, 'type': 'member'},
{'src':2, 'dst': 9, 'type': 'follow'},
{'src':3, 'dst': 2, 'type': 'connection'},
{'src':3, 'dst': 6, 'type': 'connection'},
{'src':3, 'dst': 5, 'type': 'connection'},
{'src':3, 'dst': 9, 'type': 'member'},
{'src':3, 'dst': 8, 'type': 'member'},
{'src':4, 'dst': 2, 'type': 'connection'},
{'src':5, 'dst': 1, 'type': 'connection'},
{'src':5, 'dst': 2, 'type': 'connection'},
{'src':5, 'dst': 3, 'type': 'connection'},
{'src':6, 'dst': 1, 'type': 'connection'},
{'src':6, 'dst': 2, 'type': 'connection'},
{'src':6, 'dst': 3, 'type': 'connection'},
])
Now that we have vertex and edge dataframe it is time to build the graph and make a query.
Query the Graph
Once we have vertices and edges, it will be very easy to build the graph in Graphframes:
g = GraphFrames(all_members, all_conections)
Now that we have the graph, let consider the following recommendation algorithm. We want to recommend person b to follow person a if:
1- they are not connected (neither of them is following the other one).
2- have at least 4 nodes in common. This means they both are connected to at least 4 nodes.
3- at least two of those nodes are table.
4- a is member of the above tables.
Criterias 1 and 2 can easily be done by find
api from Graphframes:
one_hub_connection = g.find("(a)-[ac1]->(c1); (b)-[bc1]->(c1); (a)-[ac2]->(c2); (b)-[bc2]->(c2); (a)-[ac3]->(c3); (b)-[bc3]->(c3); (a)-[ac4]->(c4); (b)-[bc4]->(c4); !(a)-[]->(b)")
In this query, we are asking to find two nodes a and b that are connected to 4 other generic nodes c1, c2, c3, and c4. The last statement, !(a)-[]->(b)
, requires that a and b are not connected.
The last two pieces of requirement is a little bit complex because they require a statefull query. It means we need to look at the type of intermediate nodes and the edges and keep the count of number of member type connection to tables.
Since Graphframes is based on Dataframe, for statefull queries, we need to define a filter that uses sequence operations.
sumTable = lambda cnt, entity, connection_entity: when((entity==‘table’) & (connection_entity==‘member’), cnt+1).otherwise(cnt)
condition = reduce(lambda cnt, e: sumOrg(cnt, col(e[0]).type, col(e[1]).type), [(‘c1’,‘ac1’), (‘c2’,‘ac2’), (‘c3’,‘ac3’), (‘c4’,‘ac4’)], lit(0))
The first line defines a function that holds a counter and it increases the counter if an ci
node is Table and the connection type aci
and bci
is member.
The second line is a sequence filter that can be used in where
api.
res = one_hub_connection.where(condition >= 2 )
+--------------+--------------+
| a| b|
+--------------+--------------+
|[3,Med,person]|[1,Bob,person]|
+--------------+--------------+
So answer to the query is Med and Bob. From the graph in Figure 1, it is clear that
1- They are not connect
2- They both are connected to Jerald, Alice, DeignMe, California Art, Derek
3- DesignMe and California Art are Tables
4- Med is member of California Art and DesignMe (Bob only follows them)