Deep Learning Book

开始看书！

（朱军推荐过的另一本书：Pattern Recognition and Machine Learning，一个著名论文Pay Attention to MLP）

Part I: Applied Math and Machine Learning Basics

线性模型无法学习XOR运算。

Probability

高维高斯分布：其中是正定实对称阵，给出n个变量的协方差。

高斯混合模型：

是高斯分布，其中c是隐变量。

Some mixtures can have more constraints. For example, the covariance could be share across components via the constraint .
: prior probability
: posterior probability
任意分布可以由足够多的高斯分布拟合到任意精度。

logit:

log-it，对数几率
是odds的log，其中odds

测度理论：

数学分析
巴拿赫-塔斯基定理：在选择公理成立的情况下，可以将一个三维实心球分成有限（不可测的）部分，然后仅仅通过旋转和平移到其他地方重新组合，就可以组成两个半径和原来相同的完整的球。

Jacobian matrix:

数学分析
描述了非线性坐标变换

Information Theory

Self-infomation:

unit: nat

Shannon entropy:

Kullbacj-Leibler (KL) divergence:

not symmetric

Cross-entropy:

Structured probabilistic model:

directed
undirected

Numerical Computation

rounding error:

accumulate
underflow: ~0
overflow: ~inf
- occur: exp in softmax

我们使用二阶Taylor series在处展开：

由于Hessian矩阵固定，因此我们可以解出临界点，这就是我们估算的极值点：

Optimization algorithms that use only the gradient, such as gradient descent, are called ﬁrst-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms.

Constrained Optimization:

将gradient descent的step映射到定义域上
换元
拉格朗日乘子法（条件极值）

Machine Learning Basics

Tasks:

Classification
Classification with missing inputs

变量不能同时获得，如果有n个输入，就有种获得信息的情况。
Regression: insurance premiums
Transcription: image, speech
Machine translation
Structured output: parsing, pixel-wise segmentation, image captioning
Anomaly detection
Synthesis and sampling: generate example, speech synthesis: there is no single correct output
imputation of missing values
Denoising
Density estimation

Performance measure

Experience: blurred boundaries

supervised
unsupervised:
- learn probability distribution
- synthesis
- denoising
- clustering
semi-supervised

Capacity:

choose hypothesis space: linear, quadratic, or more

Machine learning algorithms will generally perform best when their capacity

is appropriate for the true complexity of the task they need to perform and the

amount of training data they are provided with.
奥卡姆剃刀

nonparametric models:

nearest neighbor regression
wrap a parametric learning algorithm inside another algorithm that increases the number of parameters as needed

Bayes error

The no free lunch theorem

Regularization is any modiﬁcation we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

k-fold cross-validation: One problem is that no unbiased estimators of the variance of such average error estimators exist (Bengio and Grandvalet, 2004), but approximations are typically used.

https://stats.stackexchange.com/questions/136673/how-to-understand-that-mle-of-variance-is-biased-in-a-gaussian-distribution