learn the structure of a network.

Learn the structure (the directed acyclic graph) of a BN object according to a BNDataset.

Usage

learn.structure(
  bn,
  dataset,
  algo = "mmhc",
  scoring.func = "BDeu",
  initial.network = NULL,
  alpha = 0.05,
  ess = 1,
  bootstrap = FALSE,
  layering = c(),
  max.fanin = num.variables(dataset),
  max.fanin.layers = NULL,
  max.parents = num.variables(dataset),
  max.parents.layers = NULL,
  layer.struct = NULL,
  cont.nodes = c(),
  use.imputed.data = FALSE,
  use.cpc = TRUE,
  mandatory.edges = NULL,
  ...
)

# S4 method for BN,BNDataset
learn.structure(
  bn,
  dataset,
  algo = "mmhc",
  scoring.func = "BDeu",
  initial.network = NULL,
  alpha = 0.05,
  ess = 1,
  bootstrap = FALSE,
  layering = c(),
  max.fanin = num.variables(dataset) - 1,
  max.fanin.layers = NULL,
  max.parents = num.variables(dataset) - 1,
  max.parents.layers = NULL,
  layer.struct = NULL,
  cont.nodes = c(),
  use.imputed.data = FALSE,
  use.cpc = TRUE,
  mandatory.edges = NULL,
  ...
)

Arguments

bn: a BN object.
dataset: a BNDataset.
algo: the algorithm to use. Currently, one among sm (Silander-Myllymaki), mmpc (Max-Min Parent-and-Children), mmhc (Max-Min Hill Climbing, default), hc (Hill Climbing) and sem (Structural Expectation Maximization).
scoring.func: the scoring function to use. Currently, one among BDeu, AIC, BIC.
initial.network: network srtructure to be used as starting point for structure search. Can take different values: a BN object, a matrix containing the adjacency matrix of the structure of the network, or the string random.chain to sample a random chain as starting point.
alpha: confidence threshold (only for mmhc).
ess: Equivalent Sample Size value.
bootstrap: TRUE to use bootstrap samples.
layering: vector containing the layers each node belongs to (only for sm).
max.fanin: maximum number of parents for each node (only for hc, mmhc).
max.fanin.layers: matrix of available parents in each layer (only for sm -- DEPRECATED, use max.parents.layers instead).
max.parents: maximum number of parents for each node (for sm, hc, mmhc).
max.parents.layers: matrix of available parents in each layer (only for sm).
layer.struct: 0/1 matrix for indicating which layers can contain parent nodes for nodes in a layer (only for mmhc, mmpc).
cont.nodes: vector containing the index of continuous variables.
use.imputed.data: TRUE to learn the structure from the imputed dataset (if available, a check is performed). Default is to use raw dataset
use.cpc: (when using mmhc) compute Candidate Parent-and-Children sets instead of starting the Hill Climbing from an empty graph.
mandatory.edges: binary matrix, where a 1 in cell [i,j] indicates that an edge from node i to node j must be present in the final network.
...: potential further arguments for method.

Value

new BN object with DAG.

Details

We provide three algorithms in order to learn the structure of the network, that can be chosen with the algo parameter. The first is the Silander-Myllym\"aki (sm) exact search-and-score algorithm, that performs a complete evaluation of the search space in order to discover the best network; this algorithm may take a very long time, and can be inapplicable when discovering networks with more than 25--30 nodes. Even for small networks, users are strongly encouraged to provide meaningful parameters such as the layering of the nodes, or the maximum number of parents -- refer to the documentation in package manual for more details on the method parameters.

The second method is the constraint-based Max-Min Parents-and-Children (mmpc), that returns the skeleton of the network. Given the possible presence of loops, due to the non-directionality of the edges discovered, no parameter learning is possible using this algorithm. Also note that in the case of a very dense network and lots of obsevations, the statistical evaluation of the search space may take a long time. Also for this algorithm there are parameters that may need to be tuned, mainly the confidence threshold of the statistical pruning. Please refer to the rest of this documentation for their explanation.

The third algorithm is another heuristic, the Hill-Climbing (hc). It can start from the complete space of possibilities (default) or from a reduced subset of possible edges, using the cpc argument.

The fourth algorithm (and the default one) is the Max-Min Hill-Climbing heuristic (mmhc), that performs a statistical sieving of the search space followed by a greedy evaluation, by combining the MMPC and the HC algorithms. It is considerably faster than the complete method, at the cost of a (likely) lower quality. As for MMPC, the computational time depends on the density of the network, the number of observations and the tuning of the parameters.

The fifth method is the Structural Expectation-Maximization (sem) algorithm, for learning a network from a dataset with missing values. It iterates a sequence of Expectation-Maximization (in order to ``fill in'' the holes in the dataset) and structure learning from the guessed dataset, until convergence. The structure learning used inside SEM, due to computational reasons, is MMHC. Convergence of SEM can be controlled with the parameters struct.threshold and param.threshold, for the structure and the parameter convergence, respectively. for learning a network from a dataset with missing values. It iterates a sequence of Expectation-Maximization (in order to ``fill in'' the holes in the dataset) and structure learning from the guessed dataset, until convergence. The structure learning used inside SEM, due to computational reasons, is MMHC. Convergence of SEM can be controlled with the parameters struct.threshold and param.threshold, for the structure and the parameter convergence, respectively.

Search-and-score methods also need a scoring function to compute an estimated measure of each configuration of nodes. We provide three of the most popular scoring functions, BDeu (Bayesian-Dirichlet equivalent uniform, default), AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). The scoring function can be chosen using the scoring.func parameter.

Structure learning sets the dag field of the BN under study, unless bootstrap or the mmpc algorithm are employed. In these cases, given the possible presence of loops, the wpdag field is set.

In case of missing data, the default behaviour (with no other indication from the user) is to learn the structure using mmhc starting from the raw dataset.

Examples

if (FALSE) {
dataset <- BNDataset("file.header", "file.data")
bn <- BN(dataset)
# use MMHC
bn <- learn.structure(bn, dataset, alpha=0.05, ess=1, bootstrap=FALSE)

# now use Silander-Myllymaki
layers <- layering(bn)
mfl <- as.matrix(read.table(header=F,
text='0 1 1 1 1 0 1 1 1 1 0 0 8 7 7 0 0 0 14 6 0 0 0 0 19'))
bn <- learn.structure(bn, dataset, algo='sm', max.fanin=3, cont.nodes=c(),
                      layering=layers, max.fanin.layers=mfl, use.imputed.data=FALSE)
}

Usage

Arguments

Value

Details

See also

Examples