Bayesian MCMC in extended tecdat|r language: establishing linear regression model with rstan to analyze automobile data and visual diagnosis

Extension end assistant 2021-08-05 09:44:55 阅读数:344

bayesian mcmc extended tecdat language



This article will talk about Stan And how to R Use in rstan establish Stan Model . Even though Stan It provides documentation using its programming language and a user's guide with examples , But for beginners , This may be difficult to understand .


Stan Is a programming language for specifying statistical models . It is most often used as a basis for Bayesian analysis MCMC Sampler . Markov chain Monte Carlo (MCMC) It's a sampling method , Allows you to estimate a probability distribution without knowing all the mathematical properties of the distribution . It is particularly useful in Bayesian inference , Because a posteriori distribution can't be written as an expression . To use Stan, The user wants to write a Stan Program , Represent their statistical model . This program specifies the parameters in the model and the target a posteriori density .Stan The code is compiled and run with the data , Output a posteriori simulation of a set of parameters .Stan And the most popular data analysis language , Such as R、Python、shell、MATLAB、Julia and Stata The interface of . We will focus on R Use in Stan.


rstan allow R The user implements the Bayesian model . You can use familiar formulas and data.frame grammar ( Such as lm()) To fit the model . By providing precompiled for common model types stan Code to implement this Simpler Syntax . It is very convenient to use , But only for specific " Commonly used " Model type . If you need to fit different model types , Then you need to use it yourself rstan code .

The model fitting function is prefixed with stan_ Start , End with model type . The modeling function has two necessary parameters .

- The formula . A formula that specifies dependent and independent variables (y ~ x1 + x2).
- data. A data frame containing the variables in the formula .

Besides , There is also an optional a priori parameter , It allows you to change the default a priori distribution .

stan() Function to read and compile your stan Code , And fit the model on your data set .

stan() The function has two necessary arguments .
- file . Include your Stan programmatic .stan Path to file .

- data. A named list , Provide data for the model .


As a simple example to demonstrate how to specify a model in these packages , We will use car data to fit a linear regression model . Our dependent variable is mpg, All other variables are independent variables .

mtcars %>%
  • 1.
  • 2.
  • 3.



First , We will fit the model . For linear regression , We use stan function . 

  • 1.


The output shows a summary of the parameters , Including the average 、 Standard deviation and magnitude . Besides , It also shows MCMC Diagnostic statistics Rhat And effective sample size . These statistics are useful for assessing MCMC Whether the algorithm converges is very important .

Next , We will use rstan To draw up a model of the contract . Here is our model stan Code , Save in a file called stan In the file of ( You can RStudio Create a .stan file , Or use any text editor , And save with the extension .stan The file of ).

int<lower=0> N; // The number of observations
int<lower=0> K; // The number of predictions
matrix[N, K] X; // Prediction matrix
real alpha; // intercept
y ~ normal(alpha + X * beta, sigma); // Target density

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.

Stan Code in " Block " Structured in . Every Stan Each model requires three blocks , Data 、 Parameters and models .

Data blocks are used to declare variables read in as data . In our case , We have the result vector (y) And prediction matrix (X). When a matrix or vector is declared as a variable , You need to specify the dimensions of the object at the same time . therefore , We will also read the number of observations (N) And the number of predictors (K).

The variables declared in the parameter block are the variables that will be Stan Sampled variables . In the case of linear regression , The parameter of interest is the intercept term (alpha) And the coefficient of the predictor (beta). Besides , And the error term ,sigma.

The model block is where the variable probability declaration is defined . ad locum , We specify that the target variable has a normal distribution , The average value is α+X*β, The standard deviation is sigma. In this block , You can also specify a priori distribution of parameters . By default , The parameter is given a flat ( Non informative ) transcendental .

Besides , There are also some optional blocks : function 、 Converted data 、 Parameters converted and number generated .

Next , We need to Stan The program formats our data in the desired way .stan() The function requires data to be passed in as a named list , The elements are the variables you define in the data block . For this program , We create an element for N、K、X and Y A list of .

N = 32,
K = 10,
X = predictors,
y = mpg
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.

Now we have our code and data ready , We pass them on to the function to fit the model .

  • 1.

Output similar summary statistics , Include the average value of each parameter 、 Standard deviation and magnitude . These results may be similar but not identical . Why they are different , Because the statistics are calculated based on a posteriori random sampling .

Evaluate convergence

When using MCMC When fitting a model , It is important to check whether the chain converges . We recommend visualization to visually check MCMC Diagnostic results of . We'll create a track map ,Rhat Value diagram .

First , Let's create a trajectory diagram . The trajectory shows MCMC Sampling values of parameters during iteration . If the model has converged , Then the trajectory should look like a random scatter around the average . If the chain winds in parameter space , Or the chain converges to different values , That proves a problem . Let's demonstrate . 

  • 1.
  • 2.


These trajectories show that , Both models have converged . For all parameters , All four chains are mixed , There is no obvious trend .

Next , We will check Rhat value .Rhat It is a convergence diagnosis method , It compares the parameter estimates of each chain . If the chain has converged and mixed well , that Rhat The value should be close to 1. If the chain does not converge to the same value , that Rhat Value will be greater than 1.Rhat The value is 1.05 Or higher , It shows that there is a convergence problem .rhat() The function needs a Rhat The vector of values as input , So we first extract Rhat value .

 rhat() +
  • 1.
  • 2.


be-all Rhat Values are lower than 1.05, It shows that there is no convergence problem .

Stan It is a powerful tool for building Bayesian models , These bags make R Users can easily use Stan.



The most popular insights


版权声明:本文为[Extension end assistant]所创,转载请带上原文链接,感谢。