Bayesian network in R: Introduction




Sunday, February 15, 2015
Bayesian networks (BNs) are a type of graphical model that encode the conditional probability between different learning variables in a directed acyclic graph. There are benefits to using BNs compared to other unsupervised machine learning techniques. A few of these benefits are:
  1. It is easy to exploit expert knowledge in BN models. 
  2. BN models have been found to be very robust in the sense of i) noisy data, ii) missing data and iii) sparse data. 
  3. Unlike many machine learning models (including Artificial Neural Network), which usually appear as a “black box,” all the parameters in BNs have an understandable semantic interpretation. 
This post is the first in a series of “Bayesian networks in R .” The goal is to study BNs and different available algorithms for building and training, to query a BN and examine how we can use those algorithms in R programming.
The R famous package for BNs is called "bnlearn". This package contains different algorithms for BN structure learning, parameter learning and inference. In this introduction, we use one of the existing datasets in the package and show how to build a BN, train it and make an inference.

First let's load the "coronary" dataset.
data(coronary)
This data contains the following information:
  • Smoking (smoking): a two-level factor with levels no and yes.
  • M. Work (strenuous mental work): a two-level factor with levels no and yes.
  • P. Work (strenuous physical work): a two-level factor with levels no and yes.
  • Pressure (systolic blood pressure): a two-level factor with levels <140 and >140.
  • Proteins (ratio of beta and alpha lipoproteins): a two-level factor with levels.
  • Family (family anamnesis of coronary heart disease): a two-level factor with levels neg and pos.

Learn structure

The first step in a BN is to create the network. There are couples of algorithms in deriving an optimal BN structure and some of them exist in "bnlearn". However, for the purpose of this post, we limit ourselves to a “max-min hill climbing” algorithm, which is a greedy algorithm [1].
bn_df <- data.frame(coronary)
res <- hc(bn_df)
plot(res)
The above structure finding creates the following conditional dependency between different variables, and the plot function draws the BN as shown below:


The causality between some nodes is intuitive; however, some relations extracted from data does not seem to be correct. For example, it does not make sense to have Family as a variable condition on M.Work. Therefore, we need to modify the derived structure. Let’s remove the link between M.Work and Family.
res$arcs <- res$arcs[-which((res$arcs[,'from'] == "M..Work" & res$arcs[,'to'] == "Family")),]

Training

After learning the structure, we need to find out the conditional probability tables (CPTs) at each node. The bn.fit function runs the EM algorithm to learn CPT for different nodes in the above graph.
fittedbn <- bn.fit(res, data = bn_df)
For example, let look at what is inside the Protein node.
print(fittedbn$Proteins)
Protein is conditioned on M.Work and Smoking. Since both of these variables are binary variables (only two values) the CPT table has 2x2=4 entries:

Inference

Now, the BN is ready and we can start inferring from the network.
cpquery(fittedbn, event = (Proteins=="<3"), evidence = ( Smoking=="no") )
which results in 0.61. Note that although the Proteins variable is conditioned on 2 variables, we did the query based on the available evidence on only one variables. But let make our evidence richer by asking the following: What is the chance that a non-smoker with pressure greater than 140 has a Proteins level less than 3?
cpquery(fittedbn, event = (Proteins=="<3"), evidence = ( Smoking=="no" & Pressure==">140" ) )
which results in probability 0.63. 

We can also move in the opposite direction of an arc between two nodes. Let’s see if a person’s Proteins level is greater than 3, then what is the chance that his or her Pressure level is greater than 140?
cpquery(fittedbn, event = (Pressure==">140"), evidence = ( Proteins=="<3" ) )
the answer would be Pressure is greater than 140 with probability 0.41

[1] Tsamardinos, Ioannis, Laura E. Brown, and Constantin F. Aliferis. "The max-min hill-climbing Bayesian network structure learning algorithm." Machine learning 65.1 (2006): 31-78.

17 comments:

  1. Nice post. One slight correction: In 2nd line of code it should be data.frame(coronary) instead of data.frame(data).

    ReplyDelete
  2. thanks Rob, typical error when you do copy and paste :)

    ReplyDelete
  3. Very short and straight to the point. Thanks

    ReplyDelete
    Replies
    1. thanks for the feedback. In my opinion, BN is very powerful and yet people sometime don't know where to start and how to use existing package. The goal was to quickly give a start point.

      Delete
  4. Thanks for this post. Quick follow up question, can you explain the intuition why each time you run the cpquery command the result changes? I didn't quite match your results above sometimes, then realized if i ran it again, that i might match your result.
    tx

    ReplyDelete
    Replies
    1. From ?cpquery: "Note that both cpquery and cpdist are based on Monte Carlo particle filters, and therefore they may return slightly different values on different runs."

      Delete
  5. Thanks for your post, which is really helpful and informative. When I tried to run the cpquery several times with same evidence and event parameters, the probability results were different. Could you please tell me how to make the result same? Thank you very much!

    ReplyDelete
  6. Last paragraph: Proteins level is greater than 3
    then: evidence = ( Proteins=="<3" )

    ReplyDelete
  7. Hello, this post has some sequence ? More about Bayesian Networks ?

    ReplyDelete
    Replies
    1. Hi Elias, there is some draft for sequence but I haven't get a chance to polish it and post it. Stay tune. Meanwhile if you have any questions or need any help please don't hesitate to contact me directly [vi linkedin or gmail: mhfirooz].

      Delete
  8. I was looking for a starting point for bnlearn. This was perfect. Thanks

    ReplyDelete
    Replies
    1. Thanks for the comment and I am glad that it was useful to you.

      Delete
  9. Very helpful start to using Bayesian network.

    ReplyDelete
    Replies
    1. I haven't had time to finish the rest of sequel. But if you have any specific question, please shoot me an email.

      Delete
  10. Thank you. This really helps for my research

    ReplyDelete
  11. Hello! I am new in Bayesian networks, actually I do not know if they are useful on what I need to do. I actually have a set of five observable and three latent variables in my network, which at the same time are discrete and continuous variables. Observable variables are parents of the latent variables, which at the same time are the parents of the target variable I need to model (latent variable). I have no evidence information about the latent variables at all, I need to derive them using my observed variables. Is it possible to use these networks to derive latent variables? I was thinking in using a kind of multi-criteria analysis first but then since in previous work I performed a probabilistic analysis I was suggested to keep on using a probabilistic thinking under uncertainty.

    ReplyDelete

 

Favorite Quotes

"I have never thought of writing for reputation and honor. What I have in my heart must out; that is the reason why I compose." --Beethoven

"All models are wrong, but some are useful." --George Box

Copyright © 2015 • Ensemble Blogging