Using Graphframes for Social Recommendation

Tuesday, May 3, 2016

[This post is a draft for technical blog]

Graphs can be used as proper data structure in many applications from marketing data to social networks. It is amazing how representing data in terms of vertex and edges make the problem easier to solve and scalable. That is why I am supper excited about the Spark Graphframes. It is a graph processing package that sits on top of Spark Dataframe and can be considered as an extension to GraphX (which is based on RDDs). The cool thing about Graphframes is that you can run both graph algorithms and graph queries (pattern matching) within the same framework.

The purpose of this post is to show how how one can use power of pattern matching in Graphframe for recommendation in social networks. For that we use a toy dataset that is inspired by the data at

What is our network?

Consider a network with two type of objects: 1- users 2- tables.

Users can follow each other and it is one way connection.

Tables are similar to private chatrooms. They contain both private and public messages. A user can only follow a table which gives him/her access to public messages OR a user can be a member of table which means access to all messages and also privilege to send message to other members and follower of the table.

Let assume we have the following simple network of users and tables:

In this toy graph, we have 6 users and three tables. A lot can be read from this network. For example, Med follows Joe and he is a member of DesignMe table. Bob is a follower of Jerald and he is also followed by Jerald. He follows DesignMe table.

Put it in Graphframes

First lets define this graph in Graphframes. For that we need to first define two dataframes: 1- vertex dataframe 2- edge dataframe.

Vertex dataframe:

Each node in the graph would have an unique id. All nodes have a type field to identify the type of node; Is it a user node or table node. Furthermore, all nodes have a name which is their human readable identity. Finally, there is a extra filed for each node object which has a json value and it must be interpreted based on the type of the node. The vertex dataframe for graph in Figure 1 can be found as follows:

all_members = \
        {'id':1, 'name': 'Bob', 'type': 'person', 'extra':None},
        {'id':2, 'name': 'Alice', 'type': 'person', 'extra':None},
        {'id':3, 'name': 'Med', 'type': 'person', 'extra':None},
        {'id':4, 'name': 'Joe', 'type': 'person', 'extra':None},
        {'id':5, 'name': 'Jerald', 'type': 'person', 'extra':None},
        {'id':6, 'name': 'Derek', 'type': 'person', 'extra':None},
        {'id':7, 'name': 'SF photography', 'type': 'table', 'extra':None},
        {'id':8, 'name': 'DesineMe', 'type': 'table', 'extra':None},
        {'id':9, 'name': 'California Art', 'type': 'table', 'extra':None},
        {'id':10, 'name': 'Radius', 'type': 'table', 'extra':None},

Edge dataframe:

Nodes in graph can be connected to each other. Based on type of the object, the connection can be different with different information and property. The common denominator for connection properties in all of the connections are 1- source id 2- destination id 3- type of the connection.

As it is mentioned before and it depicted in Figure 1, we have three types of connection in our network.

1- person follows a person (connection): If one person is following another person

2- person follows a table (follow): a person only have access to public message of table

3- person is member of table (member): person can post a message to the table and see all other messages in the table.

Then, the edge information can be described in a dataframe as follows:

all_conections = \
        {'src':1, 'dst': 2, 'type': 'connection'},
        {'src':1, 'dst': 5, 'type': 'connection'},
        {'src':1, 'dst': 6, 'type': 'connection'},
        {'src':1, 'dst': 7, 'type': 'follow'},
        {'src':1, 'dst': 8, 'type': 'follow'},
        {'src':1, 'dst': 9, 'type': 'follow'},
        {'src':2, 'dst': 1, 'type': 'connection'},
        {'src':2, 'dst': 6, 'type': 'connection'},
        {'src':2, 'dst': 5, 'type': 'connection'},
        {'src':2, 'dst': 4, 'type': 'connection'},
        {'src':2, 'dst': 3, 'type': 'connection'},
        {'src':2, 'dst': 7, 'type': 'follow'},
        {'src':2, 'dst': 8, 'type': 'member'},
        {'src':2, 'dst': 9, 'type': 'follow'},
        {'src':3, 'dst': 2, 'type': 'connection'},
        {'src':3, 'dst': 6, 'type': 'connection'},
        {'src':3, 'dst': 5, 'type': 'connection'},
        {'src':3, 'dst': 9, 'type': 'member'},
        {'src':3, 'dst': 8, 'type': 'member'},
        {'src':4, 'dst': 2, 'type': 'connection'},
        {'src':5, 'dst': 1, 'type': 'connection'},
        {'src':5, 'dst': 2, 'type': 'connection'},
        {'src':5, 'dst': 3, 'type': 'connection'},
        {'src':6, 'dst': 1, 'type': 'connection'},
        {'src':6, 'dst': 2, 'type': 'connection'},
        {'src':6, 'dst': 3, 'type': 'connection'},

Now that we have vertex and edge dataframe it is time to build the graph and make a query.

Query the Graph

Once we have vertices and edges, it will be very easy to build the graph in Graphframes:

g = GraphFrames(all_members, all_conections)

Now that we have the graph, let consider the following recommendation algorithm. We want to recommend person b to follow person a if:

1- they are not connected (neither of them is following the other one).

2- have at least 4 nodes in common. This means they both are connected to at least 4 nodes.

3- at least two of those nodes are table.

4- a is member of the above tables.

Criterias 1 and 2 can easily be done by find api from Graphframes:

one_hub_connection = g.find("(a)-[ac1]->(c1); (b)-[bc1]->(c1); (a)-[ac2]->(c2); (b)-[bc2]->(c2); (a)-[ac3]->(c3); (b)-[bc3]->(c3); (a)-[ac4]->(c4); (b)-[bc4]->(c4); !(a)-[]->(b)")

In this query, we are asking to find two nodes a and b that are connected to 4 other generic nodes c1, c2, c3, and c4. The last statement, !(a)-[]->(b), requires that a and b are not connected.

The last two pieces of requirement is a little bit complex because they require a statefull query. It means we need to look at the type of intermediate nodes and the edges and keep the count of number of member type connection to tables.

Since Graphframes is based on Dataframe, for statefull queries, we need to define a filter that uses sequence operations.

sumTable = lambda cnt, entity, connection_entity: when((entity==‘table’) & (connection_entity==‘member’), cnt+1).otherwise(cnt)

condition = reduce(lambda cnt, e: sumOrg(cnt, col(e[0]).type, col(e[1]).type), [(‘c1’,‘ac1’), (‘c2’,‘ac2’), (‘c3’,‘ac3’), (‘c4’,‘ac4’)], lit(0))

The first line defines a function that holds a counter and it increases the counter if an ci node is Table and the connection type aci and bci is member.

The second line is a sequence filter that can be used in where api.

res = one_hub_connection.where(condition >= 2 )

|             a|             b|

So answer to the query is Med and Bob. From the graph in Figure 1, it is clear that

1- They are not connect

2- They both are connected to Jerald, Alice, DeignMe, California Art, Derek

3- DesignMe and California Art are Tables

4- Med is member of California Art and DesignMe (Bob only follows them)


  1. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
    AWS Training in Bangalore |Best AWS Training Institute in Bangalore BTM, Marathahalli
    AWS Training in Chennai | AWS Training Institute in Chennai Velachery, Tambaram, OMR

  2. Really you have done great job,There are may person searching about that now they will find enough resources by your post
    online Python certification course | python training in OMR | Python training course in Chennai

  3. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
    Online DevOps Certification Course - Gangboard
    Best Devops Training institute in Chennai

  4. After seeing your article I want to say that the presentation is very good and also a well-written article with some very good information which is very useful for the readers....thanks for sharing it and do share more posts like this.
    Java training in Chennai | Java training in Bangalore

    Java online training | Java training in Pune

  5. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.

    rpa interview questions and answers
    automation anywhere interview questions and answers
    blueprism interview questions and answers
    uipath interview questions and answers
    rpa training in chennai



Favorite Quotes

"I have never thought of writing for reputation and honor. What I have in my heart must out; that is the reason why I compose." --Beethoven

"All models are wrong, but some are useful." --George Box

Copyright © 2015 • Ensemble Blogging