Working with Bipartite graph data in pandas
Quick Summary: What is a bipartitie graph?
In a bipartite graph there are two types of nodes - for example people, and parties. The connections between nodes ( people or parties ) in this graph occur when a person attends a party, or a party is attended by a person. The connection between two people in this type of graph is indirect-- through which parties they mutually attend. This is a classic example from network analysis called the southern women dataset.
Other examples of bipartite network data:
- people - linked to organizations where they serve on the board of directors
- people - linked to news articles where they are mentioned
- people - linked to items that they put in their grocery shopping cart
- twitter users - linked to organizations that they follow
With this type of network, we learn about the nodes through their affiliation with nodes of the other type.
Working with Bipartite graph data in pandas
The Edgelist
The most common form for representing graph data is called an 'edge list', so we will start with one of those in pandas. At a minimum, two columns are needed for defining a basic non-directional graph. Each row representats a relationship between one node and another.
import pandas as pd import numpy as np np.random.seed(50) graph = pd.DataFrame({'person' : np.random.choice(range(10), 30) , 'party': np.random.choice(['A','B','C','D', 'E'], 30, [.1,.4,.2,.2,.1] ) }) graph.drop_duplicates(inplace=True) # since we're generating the data randomly , remove duplicates graph.head() party person 0 D 9 1 C 1 2 A 6 3 B 3 4 A 9
In our example parties are represented by the letters A through D, and people are represented by numbers 1 through 10. Each row of the dataframe indicates that a person attended a party.
Matrix Representation of the Graph
An equivalent representation of the edgelist is as a matrix. In the Matrix representation, one type of node will be on the columns, while the other will be on the rows. At the intersection of each row and column you'll get a 1 if there's a link between the two nodes, and a 0 if there is not. This contains the same information as the edge-list.
gmatrix = pd.crosstab(graph.party, graph.person) gmatrix person 0 1 2 3 4 5 6 7 9 party A 1 0 1 1 1 1 1 0 0 B 1 0 1 1 0 0 0 0 0 C 1 0 0 0 1 0 1 0 0 D 0 1 0 1 0 1 1 1 1 E 0 0 1 1 1 1 0 0 1
Transformating to a single-mode graph
With some simple matrix algebra we can start to answer questions about how each type of node is linked to other nodes of the same type. Rather than seeing how people are affiliated with parties, we can do a simple calculation to determine for each party how many attendees they have in common with each other party.
The matrix multiplcation is: M*T(M) ( The matrix dotted with the transform of itself) In pandas this is easy:
::: python parties = (gmatrix).dot(gmatrix.T) parties party A B C D E party A 6 3 3 3 4 B 3 3 1 1 2 C 3 1 3 1 1 D 3 1 1 6 3 E 4 2 1 3 5 # The matrix contains some redundant information # only need the lower half since direction doesn't matter in this graph # Note that the diagonal eg: (A, A) represents the total number of people at party A ( 6 ) colnames = parties.columns parties = pd.DataFrame(np.tril(parties.values, 0)) parties.index = colnames parties.columns = colnames party A B C D E party A 6 0 0 0 0 B 3 3 0 0 0 C 3 1 3 0 0 D 3 1 1 6 0 E 4 2 1 3 5
We can similarly focus on the people, answering: for each person, how many parties did they attend in common?
people=(gmatrix.T).dot(gmatrix) colnames = people.columns people = pd.DataFrame(np.tril(people.values, 0)) people.index = colnames people.columns = colnames person 0 1 2 3 4 5 6 7 9 person 0 3 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 2 0 3 0 0 0 0 0 0 3 2 1 3 4 0 0 0 0 0 4 2 0 2 2 3 0 0 0 0 5 1 1 2 3 2 3 0 0 0 6 2 1 1 2 2 2 3 0 0 7 0 1 0 1 0 1 1 1 0 9 0 1 1 2 1 2 1 1 2
Melt back into an edgelist
Now we can melt each of these matricies back into an edge-list, and answer questions about the different types of nodes. The columns source and target define the link between two nodes, and finally 'weight' identifies the strength of the connection for each type of node: * for people, it's the number of instances when both people were present together at the same party * for parties , it's the number of people in common who attended both parties.
def matrix_to_edgelist(matrix_df): el = matrix_df.unstack() el.index.names = ['source','target'] el = el.reset_index() el.rename(columns={0:'weight'}, inplace=True) return el[el.weight != 0].reset_index() el_people = matrix_to_edgelist(people) el_people source target weight 2 0 2 2 3 0 3 2 4 0 4 2 5 0 5 1 6 0 6 2 12 1 3 1 14 1 5 1 15 1 6 1 16 1 7 1 el_parties= matrix_to_edgelist(parties).head() el_parties source target weight 0 A A 6 1 A B 3 2 A C 3 3 A D 3 4 A E 4
Conclusion
It's well-known that pandas is a powerful library for working with tabular data. Networks, especially large networks, can also be managed and manipulated using the library in very clean ways. In this post I demonstrated how a few of the features of pandas' can be applied when manipulating a network dataset :
- pd.crosstab(rows, columns) <- used in this case for transforming edgelist into matrix-representation
- df.dot(df.T) <- the pandas representation of the matrix dotted with its own transpose
- np.tril(df) <- returns the lower triangular matrix of the passed-in dataframe
- matrix_df.unstack() <- transforms the matrix representation into an edge-list series.