# DQN-PARL-2048-ShunS

**Repository Path**: sunNAU/DQN-PARL-2048-ShunS

## Basic Information

- **Project Name**: DQN-PARL-2048-ShunS
- **Description**: 在飞桨PARL框架下，使用DQN算法完成2048game
- **Primary Language**: Python
- **License**: MulanPSL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 3
- **Forks**: 1
- **Created**: 2020-06-30
- **Last Updated**: 2022-07-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# DQN-PARL-2048-ShunS

#### 介绍
在飞桨PARL框架下，使用DQN算法完成2048game


#### 安装教程

1.  安装2048game环境，本项目的环境与效果展示参照了https://github.com/FelipeMarcelino/2048-Gym/
2.  飞桨与PARL环境：

```
pip install paddlepaddle-gpu==1.8.2.post97
pip install parl==1.3.1
```


#### 使用说明

1.  训练

```
python train.py
```

2.  测试并展示最好结果

```
python show.py
```

3.  由于上传问题，训练好的权重并未上传


#### 实现与总结
1.  现有方法主要有三类，分别是使用N-tuple network的DT Learning、监督学习、强化学习，其中DT Learning通话化简空间维度效果最好，可以达到30w分以上，但参数多，其次是监督学习，通过使用DT Learning的结果进行学习，可以达到9w分以上，最后是强化学习，可以达到3w分以上。
![不同算法对比](https://images.gitee.com/uploads/images/2020/0630/122317_ca84ae05_5440781.jpeg "不同算法对比")

DT学习是2048game中的学霸，能够近乎完全记录探索学习结果。

监督学习可以看做学霸辅导天赋比较好的同学，可以取得比较好的效果，减少参数数量，但毕竟不是自己学习出来的，扩展性存疑。

强化学习通过自己探索提高分数，但总是达不到高分，具体原因还需考究。现有研究结果表明增大卷积层的滤波器个数，可以提高分数，但尚不清楚如果参数增加到与DT学习一样的规模，算法性能的极限能达到多少?

可以参考的论文：
> Playing game 2048 with deep convolutional neural networks trained by supervised learning

> Temporal difference learning of N-tuple networks for the game 2048

> Mastering 2048 With Delayed Temporal Coherence Learning, Multistage Weight Promotion, Redundant Encoding, and Carousel Shaping

可以参考的代码
> https://github.com/tjwei/rl/ 展示：https://tjwei.github.io/2048-NN/

> https://github.com/nneonneo/2048-ai   展示：http://ovolve.github.io/2048-AI/

> https://github.com/navjindervirdee/2048-deep-reinforcement-learning

> https://github.com/FelipeMarcelino/2048-Gym/


2.  关于DQN

算法介绍可以参照我在知乎上关于集中经典算法的总结：[[总结]强化学习7日打卡营——PARL框架下的强化学习实践](https://zhuanlan.zhihu.com/p/150898201)

首先仿真对比了不同γ参数情况的算法效果

|![γ=0.99](https://images.gitee.com/uploads/images/2020/0630/122936_0628cd6a_5440781.png "γ=0.99")|![γ=0.995](https://images.gitee.com/uploads/images/2020/0630/122944_500e9aa4_5440781.png "γ=0.995")|![γ=0.999](https://images.gitee.com/uploads/images/2020/0630/122950_ff7e18c5_5440781.png "γ=0.999")|
|--|--|--|
|γ=0.99|γ=0.995|γ=0.999|

γ参数的大小决定的算法的“前瞻”性能，0.99^1000约为0，0.995^1000=0.00665，0.999^1000=0.36769，而到达15000分时，需要约为1000步，因此需要选择较大一点的γ，仿真中选择γ=0.995，进行20000次episode，学习率为0.001，进行100次测试，得到平均成绩为10591.08，最好成绩16340，结果图为

![最好成绩](https://images.gitee.com/uploads/images/2020/0630/125600_fc5ed107_5440781.jpeg "最好成绩")

视频演示地址：[视频演示](https://gitee.com/devilofshine/DQN-PARL-2048-ShunS/blob/master/result/2020-06-30-1253-51.mp4)

其中最大瓦片的分布为
| 128 | 256 | 512 | 1024 |
|-----|-----|-----|------|
| 1%  | 7% | 34% | 58%  |

训练过程为

![训练过程](https://images.gitee.com/uploads/images/2020/0630/124309_72a30682_5440781.png "训练过程")

训练中，算法波动着上升，20000次episode后未见收敛稳定，还有继续训练提高分数的可能，介于时间原因，暂按下不表

3.  关于策略梯度（PG）方法

策略梯度方法使用概率确定action，如果初始输出的action概率中有一个接近1，就很难探索优化了。因为sample的时候按照action的概率去探索的，若某一个动作的概率一直未0.9999，则无法充分探索所有状态-动作空间，算法无法收敛，实验结果一直在100分左右徘徊，需要在查阅相关文献，改进算法。

![策略梯度方法](https://images.gitee.com/uploads/images/2020/0630/124836_a16cf2f3_5440781.png "策略梯度方法")

4.  关于SAC方法

soft Actor-Critic 方法结合了value-based DQN与policy-based GP两种方法，但在实现中，当同步Critic网络权重时总是报错：


```
    def sync_target(self, decay=None, share_vars_parallel_executor=None):
        """ self.target_model从self.model复制参数过来，若decay不为None,则是软更新
        """
        if decay is None:
            decay = 1.0 - self.tau
        print('`````````````````')
        print(self.model.get_actor_params())
        print('=============')
        print(self.target_model.get_actor_params())
        self.model.sync_weights_to(
            self.target_model,
            decay=decay)
```

```
KeyError: 'Unable to find the variable:PARL_target_conv2d.b_0_0. Synchronize paramsters before initialization or attr_name does not exist.'
```

不知道为什么出错，有大神知道的请联系我，不胜感激~

PG和SAC算法一并奉上，有什么错误还请指正

就酱，收工~