### Bayesian MCMC in extended tecdat|r language: establishing linear regression model with rstan to analyze automobile data and visual diagnosis

Extension end assistant 2021-08-05 09:44:55 阅读数:344

bayesian mcmc extended tecdat language

This article will talk about Stan And how to R Use in rstan establish Stan Model . Even though Stan It provides documentation using its programming language and a user's guide with examples , But for beginners , This may be difficult to understand .

## Stan

Stan Is a programming language for specifying statistical models . It is most often used as a basis for Bayesian analysis MCMC Sampler . Markov chain Monte Carlo （MCMC） It's a sampling method , Allows you to estimate a probability distribution without knowing all the mathematical properties of the distribution . It is particularly useful in Bayesian inference , Because a posteriori distribution can't be written as an expression . To use Stan, The user wants to write a Stan Program , Represent their statistical model . This program specifies the parameters in the model and the target a posteriori density .Stan The code is compiled and run with the data , Output a posteriori simulation of a set of parameters .Stan And the most popular data analysis language , Such as R、Python、shell、MATLAB、Julia and Stata The interface of . We will focus on R Use in Stan.

## `rstan`

`rstan allow R The user implements the Bayesian model . You can use familiar formulas and data.frame grammar （ Such as lm()） To fit the model . By providing precompiled for common model types stan Code to implement this Simpler Syntax . It is very convenient to use , But only for specific " Commonly used " Model type . If you need to fit different model types , Then you need to use it yourself rstan code .`

The model fitting function is prefixed with stan_ Start , End with model type . The modeling function has two necessary parameters .

- The formula . A formula that specifies dependent and independent variables （y ~ x1 + x2）.
- data. A data frame containing the variables in the formula .

Besides , There is also an optional a priori parameter , It allows you to change the default a priori distribution .

stan() Function to read and compile your stan Code , And fit the model on your data set .

stan() The function has two necessary arguments .
- file . Include your Stan programmatic .stan Path to file .

- data. A named list , Provide data for the model .

## Example

As a simple example to demonstrate how to specify a model in these packages , We will use car data to fit a linear regression model . Our dependent variable is mpg, All other variables are independent variables .

``````
mtcars %>%

1.
2.
3.
``` First , We will fit the model . For linear regression , We use stan function .

````summary(fit)`

1.
```  The output shows a summary of the parameters , Including the average 、 Standard deviation and magnitude . Besides , It also shows MCMC Diagnostic statistics Rhat And effective sample size . These statistics are useful for assessing MCMC Whether the algorithm converges is very important .

Next , We will use rstan To draw up a model of the contract . Here is our model stan Code , Save in a file called stan In the file of （ You can RStudio Create a .stan file , Or use any text editor , And save with the extension .stan The file of ）.

``````
data
int<lower=0> N; // The number of observations
int<lower=0> K; // The number of predictions
matrix[N, K] X; // Prediction matrix
...
Parameters
real alpha; // intercept
...
Model
y ~ normal(alpha + X * beta, sigma); // Target density
```

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
```

Stan Code in " Block " Structured in . Every Stan Each model requires three blocks , Data 、 Parameters and models .

Data blocks are used to declare variables read in as data . In our case , We have the result vector （y） And prediction matrix （X）. When a matrix or vector is declared as a variable , You need to specify the dimensions of the object at the same time . therefore , We will also read the number of observations （N） And the number of predictors （K）.

The variables declared in the parameter block are the variables that will be Stan Sampled variables . In the case of linear regression , The parameter of interest is the intercept term （alpha） And the coefficient of the predictor （beta）. Besides , And the error term ,sigma.

The model block is where the variable probability declaration is defined . ad locum , We specify that the target variable has a normal distribution , The average value is α+X*β, The standard deviation is sigma. In this block , You can also specify a priori distribution of parameters . By default , The parameter is given a flat （ Non informative ） transcendental .

Besides , There are also some optional blocks ： function 、 Converted data 、 Parameters converted and number generated .

Next , We need to Stan The program formats our data in the desired way .stan() The function requires data to be passed in as a named list , The elements are the variables you define in the data block . For this program , We create an element for N、K、X and Y A list of .

``````list(
N = 32,
K = 10,
X = predictors,
y = mpg
)```

1.
2.
3.
4.
5.
6.
```

Now we have our code and data ready , We pass them on to the function to fit the model .

````fit_rstan`

1.
``` ` Output similar summary statistics , Include the average value of each parameter 、 Standard deviation and magnitude . These results may be similar but not identical . Why they are different , Because the statistics are calculated based on a posteriori random sampling .`

## Evaluate convergence

When using MCMC When fitting a model , It is important to check whether the chain converges . We recommend visualization to visually check MCMC Diagnostic results of . We'll create a track map ,Rhat Value diagram .

First , Let's create a trajectory diagram . The trajectory shows MCMC Sampling values of parameters during iteration . If the model has converged , Then the trajectory should look like a random scatter around the average . If the chain winds in parameter space , Or the chain converges to different values , That proves a problem . Let's demonstrate .

``````
mcmctrace()```

1.
2.
```  These trajectories show that , Both models have converged . For all parameters , All four chains are mixed , There is no obvious trend .

Next , We will check Rhat value .Rhat It is a convergence diagnosis method , It compares the parameter estimates of each chain . If the chain has converged and mixed well , that Rhat The value should be close to 1. If the chain does not converge to the same value , that Rhat Value will be greater than 1.Rhat The value is 1.05 Or higher , It shows that there is a convergence problem .rhat() The function needs a Rhat The vector of values as input , So we first extract Rhat value .

`````` rhat() +
yaxis_text()```

1.
2.
```  be-all Rhat Values are lower than 1.05, It shows that there is no convergence problem .

Stan It is a powerful tool for building Bayesian models , These bags make R Users can easily use Stan. The most popular insights