这里主要是各种优化器,以及使用。因为大多数机器学习任务就是最小化损失,在损失定义的情况下,后面的工作就交给优化器啦。
因为深度学习常见的是对于梯度的优化,也就是说,优化器最后其实就是各种对于梯度下降算法的优化。
理论部分可以见斯坦福深度学习的课程。同时这里推荐一个博客,总结了这些优化器的原理以及性能,写的挺好的: An overview of gradient descent optimazation algorithms
从其中讲几个比较常用的,其他的可以自己去看文档。官方文档: Training
优化器(optimizers)类的基类。这个类定义了在训练模型的时候添加一个操作的API。你基本上不会直接使用这个类,但是你会用到他的子类比如, , .等等这些。
后面讲的时候会详细讲一下 这个类的一些函数,然后其他的类只会讲构造函数,因为类中剩下的函数都是大同小异的
这个类是实现梯度下降算法的优化器。(结合理论可以看到,这个构造函数需要的一个学习率就行了)
(learning_rate, use_locking=False,name=’GradientDescent’)
compute_gradients(loss,var_list=None,gate_gradients=GATE_OP,aggregation_method=None,colocate_gradients_with_ops=False,grad_loss=None)
apply_gradients(grads_and_vars,global_step=None,name=None)
get_name()
minimize(loss,global_step=None,var_list=None,gate_gradients=GATE_OP,aggregation_method=None,colocate_gradients_with_ops=False,name=None,grad_loss=None)
实现了 Adadelta算法的优化器,可以算是下面的Adagrad算法改进版本
构造函数:
tf.train.AdadeltaOptimizer.init(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name=’Adadelta’)
作用:构造一个使用Adadelta算法的优化器
参数:
learning_rate: tensor或者浮点数,学习率
rho: tensor或者浮点数. The decay rate.
epsilon: A Tensor or a floating point value. A constant epsilon used to better conditioning the grad update.
use_locking: If True use locks for update operations.
name: 【可选】这个操作的名字,默认是”Adadelta”
Optimizer that implements the Adagrad algorithm.
See this paper.
tf.train.AdagradOptimizer.(learning_rate, initial_accumulator_value=0.1, use_locking=False, name=’Adagrad’)
Construct a new Adagrad optimizer.
Args:
- 1
- 2
- 3
- 4
- 5
Raises:
The Optimizer base class provides methods to compute gradients for a loss and apply gradients to variables. A collection of subclasses implement classic optimization algorithms such as GradientDescent and Adagrad.
You never instantiate the Optimizer class itself, but instead instantiate one of the subclasses.
Optimizer that implements the Momentum algorithm.
tf.train.MomentumOptimizer.(learning_rate, momentum, use_locking=False, name=’Momentum’, use_nesterov=False)
Construct a new Momentum optimizer.
Args:
learning_rate: A Tensor or a floating point value. The learning rate.
momentum: A Tensor or a floating point value. The momentum.
use_locking: If True use locks for update operations.
name: Optional name prefix for the operations created when applying gradients. Defaults to “Momentum”.
实现了Adam算法的优化器
构造函数:
tf.train.AdamOptimizer.(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name=’Adam’)
Construct a new Adam optimizer.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Note that in dense implement of this algorithm, m_t, v_t and variable will update even if g is zero, but in sparse implement, m_t, v_t and variable will not update in iterations g is zero.
Args:
learning_rate: A Tensor or a floating point value. The learning rate.
beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability.
use_locking: If True use locks for update operations.
name: Optional name for the operations created when applying gradients. Defaults to “Adam”.