Changelog

OneFlow 发布了新版本 0.3.2，这个版本以及之前的 0.3.1 版本都是大版本 0.3.0 的 minor 版本，所以在此一并介绍。
在这个版本中，引入了大量性能优化、加入了不少新的 feature，率先支持了 CUDA 11.1。

主要新功能一览

支持亚线性内存优化
通过 oneflow.experimental.scope(checkpointing=self.checkpoint_activations) 开启，大幅节省内存。例如：

def transformer_layer(self, name, x, *, past): # ... with flow.scope.namespace(name): x = flow.identity(x) with flow.experimental.scope.config( checkpointing=self.checkpoint_activations ): norm1 = norm(x, name="layernorm_1") # ...

新版本的 checkpoint
新版本的 checkpoint 大幅提高了灵活性。支持部分加载/保存，支持获取权重的值（可用于打印等操作），支持使用 numpy 数组给权重赋值。

with tempfile.TemporaryDirectory() as save_dir:
    refresh_session()
    large1 = get_checkpoint_ready_model(model_getter, dtype)
    flow.checkpoint.save(save_dir)
    res1 = large1()
    refresh_session()
    large2 = get_checkpoint_ready_model(model_getter, dtype)
    vars_in_file = flow.checkpoint.get(save_dir)
    flow.load_variables(vars_in_file)
    res2 = large2()

refresh_session()
model = get_checkpoint_ready_model(get_add_and_reduce_mean_model, dtype)
var_x = flow.get_all_variables()["x"]
var_y_value_before_loading = flow.get_all_variables()["y"].numpy()
new_val_np = np.random.random(var_x.shape).astype(np.float32)
flow.load_variables({
     "x": new_val_np})
var_y_value_after_loading = flow.get_all_variables()["y"].numpy()
flow_res = model()

支持 dynamic loss scale schedule
具体开启方式：

loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000)
optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)

支持最新的 CUDA 11.1

可以通过如下命令安装:

python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user

提供预先编译的带 XLA 张量编译器的安装包（支持CUDA 10,10.1,10.2,11.0）

可以通过如下命令安装:
```
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
```

主要改进和 bug 修复

Changelog v0.3.0 ~ v0.3.2 (16/12/2020)

Op 修复和优化

优化了 scalar mul by tensor, cast scale， prelu，fused_scale_tril 等 Op 和 Op 组合

[enhancement][op] Dev sx xla clip #3656
[enhancement][op] Add UserOp::InferSbpSignature #3699
[bug][op] Fix fuse scalar mul by tensor sbp #3692
[bug][op] fix softmax condition #3675
[enhancement][op] slice_update op #3544
[enhancement][op] optimize rmsprop and lars optimizers #3809
[enhancement][op] add oneflow_range #3725
[enhancement][op] torch.gather #3602
[bug][op] skip conv2d padding dynamic test case #3813
[bug][op] Fix __hne in BinaryFuncFloorMod #3788
[bug][op] Fix bn[_add]_relu test case #3767
[enhancement][op][system] Make class Tensor abstract #3757
[enhancement][op] Add user_op::KernelCreateContext #3739
[bug][op] fix warning #3732
[api][enhancement][op] User op registry attr #3716
[enhancement][op][refactor] Dev refactor user op registry attr #3714
[bug][op] fix argwhere format #4010
[enhancement][op] Argwhere support empty blob #4009
[enhancement][op] Fuse cast scale #3999
[enhancement][op] layer_norm_grad_add_to_output #3998
[enhancement][op] Dev optimize prelu #3987
[api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
[enhancement][op] Optimize slice kernel #3989
[bug][op] Hotfix: add parallel cast to amp clear list #3988
[enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
[bug][op] add combined margin cpu and fix bug #3961
[bug][op] fix pad op #3971
[bug][op] Fix constant init value #3947
[bug][op] indexed_slices_model_update handle empty tensor #3933
[bug][op] fix distribute_clone sbp #3803
[bug][op] Reshape backward issue with distribute split #3915
[enhancement][op] Remove NormalModelUpdateOpConf #3917
[enhancement][op] Dev unsorted segment sum #3731
[bug][op] Dev split like add backward #3901
[bug][op] distribute concat out dynamic false #3899
[enhancement][op] UserOpWrapper add HasGradTensor4OpOutput #3904
[enhancement][op] Unpack/Pack user op #3727
[enhancement][op] adam_bias_correction_learning_rate #3763
[enhancement][op][serving] add flatten op implementation #3789
[enhancement][op] Dev enhance sort ops #3828
[enhancement][op] Optimize softmax cuda kernel block size #3853
[enhancement][op] SplitLikeOp prefix support #3866
[bug][op] fix gather set_is_dynamic #3900
[bug][op] fix unsorted segment sum like #3898

新增 Op 和已有 Op 的新功能

增加了 polyval, swish, mish, multi_square_sum, mseloss, lamb, triplet loss 等 Op

[enhancement][op] Add polyval op #3541
[feature][op] Add broadcast like backward #3665
[feature][op] Add cuda_pseudo_half.h #3669
[feature][op][python] add swish activation #3970
[feature][op][python] add mish activation #3972
[feature][op] Add multi_square_sum op #3977
[feature][op] TripOp add fill value #3960
[feature][op] add combined margin loss #3819
[feature][op] dynamic loss scale schedule op #3885
[feature][op][python] add mseloss #3893
[feature][op] LAMB support #3620
[feature][op] logical slice_assign and slice op #3647
[feature][op][system] Add Repeat/Acc user op #3707
[feature][op][ssp] Ssp variable proxy #3715
[feature][op] multi_count_not_finite op #3879
[feature][op] model update op add skip if #3883
[feature][python] Add triplet loss #3864

系统组件

OneFlow Collective Boxing支持NCCL All2All，支持 CUDA11.1 编译

[feature][system] Add Nccl All2All #3538
[WIP][bug][system] Add attribute “batch_axis_non_change” to oneflow.transpose #3685
[bug][system] fix memcopy #3687
[documentation][enhancement][system] change url link of api docs #3677
[enhancement][system] Op collection #3833
[bug][system] fix pybind11 include #3876
[enhancement][system] Dev replace str to cfg obj in python callback #3832
[enhancement][system] Dev cpp instructions builder #3829
[enhancement][system] Dev forward declare cfg #3808
[bug][system] Fix CUDA 11.1 compiler crashes #3795
[bug][system] Bakcport bug fixes for distributed run from multi node ci #3765
[bug][system] Fix handle remote regst #3761
[enhancement][system] Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
[enhancement][system] Dev scope attr value #3756
[enhancement][system] rename UserOpAttrVal to AttrValue #3752
[enhancement][system] refactor OpGraphPass to JobPass #3745
[enhancement][system] RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
[enhancement][system] Log WARNING to stderr #3713
[enhancement][system] Use cudaMemcpyDefault #3700
[enhancement][system] Migrate foreigns to pybind11 #3939
[enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
[feature][system] OptimizerPlacementOptimization #3944
[feature][system] New checkpoint #3540
[enhancement][system] Sublinear memory cost by checkpointing #3976
[enhancement][system] Add gradients stats aggregation #3979
[feature][system] nccl enable mixed fusion #3981
[enhancement][system] remove serialized in python callback #3891
[bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
[feature][system] Add NaiveB2PSubTskGphBuilder #3942
[bug][system] disable new checkpoint by default temporarily #3943
[bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
[enhancement][system] Add ssp variable proxy #3859
[cfg][enhancement][system] Dev switch error proto with cfg error proto #3858
[enhancement][refactor][system] New Chain #3874
[feature][system] DynamicLossScale #3886
[bug][system] Remove CheckNoCycle in chain graph #3693
[feature][ssp][system] Memory Reuse support time shape > meta shape #3796
[feature][system] OneFlow support tensor shape max dim size up to 6 #3802
[bug][enhancement][system] Support Ampere devices #3806
[enhancement][system] Simple kernel memory bandwidth profiler #3855

Eager 模式

修复了一系列 bug

[bug][eager] Use universal start global device id for all streams #3701
[bug][eager] Ci add eager #3672
[bug][eager] Fix eager mode bug #3681
[eager][feature] Eager transport #3598
[eager][enhancement][python][refactor] rm scope_proto symbol_id #3865
[cfg][eager][enhancement] Replace py instruction to CFG Instruction #3773
[eager][enhancement][refactor] refactor ParallelDescSymbol #3774
[eager][feature] use proxy blob_object for boxing, add some inter-node boxing #3711
[bug][eager] fix unpacked mirrored blob object shape #3703
[bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
[bug][eager] barrier for multi node eager #3748

Python 前端

[api][documentation][python] Dev add api rst #3695
[feature][python][refactor] add check in deconv #3835
[bug][enhancement][python] fix stirng format in py35 #3878
[bug][python] fix exception in BlobObject del #3742
[bug][python] make float/double as aliases of float32/float64 #3740
[api][bug][documentation][python] Fix placement api doc #3638
[cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
[feature][python] add bceloss #3804
[enhancement][feature][python] add l1 loss op in python #3793

工具链

更多的 SWIG 接口由 Pybind11 替换

[documentation][tooling] Add api docs zzk #3680
[documentation][tooling] Add api docs zzk #3587
[cfg][enhancement][tooling] Cfg template operator reform #3861
[cfg][enhancement][tooling] Dev use union instead of struct for oneof #3870
[cfg][enhancement][tooling] Sort cfg obj forward declare #3844
[enhancement][tooling] Dev move run instruction to pybind #3775
[bug][cfg][tooling] fix cfg module load error bug #3815
[bug][tooling] Fix oneflow worker launch in py35 #3778
[bug][cfg][tooling] Fix cfg sub proto mudule process bug #3729
[enhancement][tooling] Dev data onerec #3104
[cfg][enhancement][tooling] Dev compare cfg file #3717
[bug][tooling] remove proton not related to Instruction #3708
[bug][cfg][tooling] Dev switch instruction to cfg instruction #3702
[cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
[api][enhancement][refactor][tooling] Refine custom op build #3925
[enhancement][tooling] default show cpp error stack frame #3948
[cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
[cfg][enhancement][tooling] optimize cfg generator to save time #3906
[enhancement][feature][tooling] Py kernel2 #3686

编译

修复 NVCC 参数，C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量，修复编译可能出现的 make -j，修复手动编译的时候 include 目录消失

[build][documentation] fix readme #3694
[bug][build] fix missing symbol when load so #3676
[bug][build] Fix CUDA_NVCC_GENCODES #3869
[build][documentation] Add info in readme about how to build oneflow in docker #3781
[build][ci][enhancement] Add bazel_cache dir for XLA build #3766
[bug][build] fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
[build][ci][enhancement] Refactor build script #3698
[bug][build] fix make -j in grpc and openssl #3724
[bug][build] detect cxx11 abi availibility in cmake #3709
[bug][build] fix include files not copied #3907

CI

提升运行速度和稳定性，支持分布式环境

[bug][ci] test use uuid log dir #3689
[ci][enhancement] Run check_license_and_format in every branch #3683
[ci][feature][test] Parallel run op cases #3670
[ci][enhancement] Run xla and pure cpu only when cuda test succeeds #3679
[ci][documentation][enhancement] add requirements.txt for api-docs #3671
[ci][enhancement] ci add label check workflow #3664
[ci][enhancement] CI merge all jobs into one #3868
[ci][enhancement] Check label every push #3863
[ci][enhancement] Update hard coded host affiliations #3847
[ci][enhancement] External PR skip oss steps #3843
[ci][enhancement] ci use pull_request ev #3842
[ci][enhancement] ci only use pull_request_target #3840
[ci][enhancement] Add pull_request_target to allow forks access secrets when CI triggerd #3837
[ci][enhancement] CI run when bot is requested review #3831
[ci][enhancement] Prevent CI failure #3830
[ci][enhancement] ci dont test 2n8c #3786
[ci][enhancement] upload bin to oss #4000
[ci][enhancement][test] larger tol for bn #3965
[bug][ci] fix oss list file 100 limit #3935
[ci][enhancement] Refine release oss url #3924
[ci][enhancement] Build master whl once a day #3894
[ci][feature] Multi node support in CI #3735