Changelog
OneFlow 发布了新版本 0.3.2,这个版本以及之前的 0.3.1 版本都是大版本 0.3.0 的 minor 版本,所以在此一并介绍。
在这个版本中,引入了大量性能优化、加入了不少新的 feature,率先支持了 CUDA 11.1。
主要新功能一览
支持亚线性内存优化
通过oneflow.experimental.scope(checkpointing=self.checkpoint_activations)
开启,大幅节省内存。例如:def transformer_layer(self, name, x, *, past): # ... with flow.scope.namespace(name): x = flow.identity(x) with flow.experimental.scope.config( checkpointing=self.checkpoint_activations ): norm1 = norm(x, name="layernorm_1") # ...
新版本的 checkpoint
新版本的 checkpoint 大幅提高了灵活性。支持部分加载/保存,支持获取权重的值(可用于打印等操作),支持使用 numpy 数组给权重赋值。with tempfile.TemporaryDirectory() as save_dir: refresh_session() large1 = get_checkpoint_ready_model(model_getter, dtype) flow.checkpoint.save(save_dir) res1 = large1() refresh_session() large2 = get_checkpoint_ready_model(model_getter, dtype) vars_in_file = flow.checkpoint.get(save_dir) flow.load_variables(vars_in_file) res2 = large2() refresh_session() model = get_checkpoint_ready_model(get_add_and_reduce_mean_model, dtype) var_x = flow.get_all_variables()["x"] var_y_value_before_loading = flow.get_all_variables()["y"].numpy() new_val_np = np.random.random(var_x.shape).astype(np.float32) flow.load_variables({ "x": new_val_np}) var_y_value_after_loading = flow.get_all_variables()["y"].numpy() flow_res = model()
支持 dynamic loss scale schedule
具体开启方式:loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000) optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)
支持最新的 CUDA 11.1
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user
提供预先编译的带 XLA 张量编译器的安装包(支持CUDA 10,10.1,10.2,11.0)
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
主要改进和 bug 修复
Changelog v0.3.0 ~ v0.3.2 (16/12/2020)
Op 修复和优化
优化了 scalar mul by tensor
, cast scale
, prelu
,fused_scale_tril
等 Op 和 Op 组合
- [enhancement][op] Dev sx xla clip #3656
- [enhancement][op] Add UserOp::InferSbpSignature #3699
- [bug][op] Fix fuse scalar mul by tensor sbp #3692
- [bug][op] fix softmax condition #3675
- [enhancement][op] slice_update op #3544
- [enhancement][op] optimize rmsprop and lars optimizers #3809
- [enhancement][op] add oneflow_range #3725
- [enhancement][op] torch.gather #3602
- [bug][op] skip conv2d padding dynamic test case #3813
- [bug][op] Fix __hne in BinaryFuncFloorMod #3788
- [bug][op] Fix bn[_add]_relu test case #3767
- [enhancement][op][system] Make class Tensor abstract #3757
- [enhancement][op] Add user_op::KernelCreateContext #3739
- [bug][op] fix warning #3732
- [api][enhancement][op] User op registry attr #3716
- [enhancement][op][refactor] Dev refactor user op registry attr #3714
- [bug][op] fix argwhere format #4010
- [enhancement][op] Argwhere support empty blob #4009
- [enhancement][op] Fuse cast scale #3999
- [enhancement][op] layer_norm_grad_add_to_output #3998
- [enhancement][op] Dev optimize prelu #3987
- [api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
- [enhancement][op] Optimize slice kernel #3989
- [bug][op] Hotfix: add parallel cast to amp clear list #3988
- [enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
- [bug][op] add combined margin cpu and fix bug #3961
- [bug][op] fix pad op #3971
- [bug][op] Fix constant init value #3947
- [bug][op] indexed_slices_model_update handle empty tensor #3933
- [bug][op] fix distribute_clone sbp #3803
- [bug][op] Reshape backward issue with distribute split #3915
- [enhancement][op] Remove NormalModelUpdateOpConf #3917
- [enhancement][op] Dev unsorted segment sum #3731
- [bug][op] Dev split like add backward #3901
- [bug][op] distribute concat out dynamic false #3899
- [enhancement][op] UserOpWrapper add HasGradTensor4OpOutput #3904
- [enhancement][op] Unpack/Pack user op #3727
- [enhancement][op] adam_bias_correction_learning_rate #3763
- [enhancement][op][serving] add flatten op implementation #3789
- [enhancement][op] Dev enhance sort ops #3828
- [enhancement][op] Optimize softmax cuda kernel block size #3853
- [enhancement][op] SplitLikeOp prefix support #3866
- [bug][op] fix gather set_is_dynamic #3900
- [bug][op] fix unsorted segment sum like #3898
新增 Op 和已有 Op 的新功能
增加了 polyval
, swish
, mish
, multi_square_sum
, mseloss
, lamb
, triplet loss
等 Op
- [enhancement][op] Add polyval op #3541
- [feature][op] Add broadcast like backward #3665
- [feature][op] Add cuda_pseudo_half.h #3669
- [feature][op][python] add swish activation #3970
- [feature][op][python] add mish activation #3972
- [feature][op] Add multi_square_sum op #3977
- [feature][op] TripOp add fill value #3960
- [feature][op] add combined margin loss #3819
- [feature][op] dynamic loss scale schedule op #3885
- [feature][op][python] add mseloss #3893
- [feature][op] LAMB support #3620
- [feature][op] logical slice_assign and slice op #3647
- [feature][op][system] Add Repeat/Acc user op #3707
- [feature][op][ssp] Ssp variable proxy #3715
- [feature][op] multi_count_not_finite op #3879
- [feature][op] model update op add skip if #3883
- [feature][python] Add triplet loss #3864
系统组件
OneFlow Collective Boxing支持NCCL All2All,支持 CUDA11.1 编译
- [feature][system] Add Nccl All2All #3538
- [WIP][bug][system] Add attribute “batch_axis_non_change” to
oneflow.transpose
#3685 - [bug][system] fix memcopy #3687
- [documentation][enhancement][system] change url link of api docs #3677
- [enhancement][system] Op collection #3833
- [bug][system] fix pybind11 include #3876
- [enhancement][system] Dev replace str to cfg obj in python callback #3832
- [enhancement][system] Dev cpp instructions builder #3829
- [enhancement][system] Dev forward declare cfg #3808
- [bug][system] Fix CUDA 11.1 compiler crashes #3795
- [bug][system] Bakcport bug fixes for distributed run from multi node ci #3765
- [bug][system] Fix handle remote regst #3761
- [enhancement][system] Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
- [enhancement][system] Dev scope attr value #3756
- [enhancement][system] rename UserOpAttrVal to AttrValue #3752
- [enhancement][system] refactor OpGraphPass to JobPass #3745
- [enhancement][system] RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
- [enhancement][system] Log WARNING to stderr #3713
- [enhancement][system] Use cudaMemcpyDefault #3700
- [enhancement][system] Migrate foreigns to pybind11 #3939
- [enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
- [feature][system] OptimizerPlacementOptimization #3944
- [feature][system] New checkpoint #3540
- [enhancement][system] Sublinear memory cost by checkpointing #3976
- [enhancement][system] Add gradients stats aggregation #3979
- [feature][system] nccl enable mixed fusion #3981
- [enhancement][system] remove serialized in python callback #3891
- [bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
- [feature][system] Add NaiveB2PSubTskGphBuilder #3942
- [bug][system] disable new checkpoint by default temporarily #3943
- [bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
- [enhancement][system] Add ssp variable proxy #3859
- [cfg][enhancement][system] Dev switch error proto with cfg error proto #3858
- [enhancement][refactor][system] New Chain #3874
- [feature][system] DynamicLossScale #3886
- [bug][system] Remove CheckNoCycle in chain graph #3693
- [feature][ssp][system] Memory Reuse support time shape > meta shape #3796
- [feature][system] OneFlow support tensor shape max dim size up to 6 #3802
- [bug][enhancement][system] Support Ampere devices #3806
- [enhancement][system] Simple kernel memory bandwidth profiler #3855
Eager 模式
修复了一系列 bug
- [bug][eager] Use universal start global device id for all streams #3701
- [bug][eager] Ci add eager #3672
- [bug][eager] Fix eager mode bug #3681
- [eager][feature] Eager transport #3598
- [eager][enhancement][python][refactor] rm scope_proto symbol_id #3865
- [cfg][eager][enhancement] Replace py instruction to CFG Instruction #3773
- [eager][enhancement][refactor] refactor ParallelDescSymbol #3774
- [eager][feature] use proxy blob_object for boxing, add some inter-node boxing #3711
- [bug][eager] fix unpacked mirrored blob object shape #3703
- [bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
- [bug][eager] barrier for multi node eager #3748
Python 前端
- [api][documentation][python] Dev add api rst #3695
- [feature][python][refactor] add check in deconv #3835
- [bug][enhancement][python] fix stirng format in py35 #3878
- [bug][python] fix exception in BlobObject del #3742
- [bug][python] make float/double as aliases of float32/float64 #3740
- [api][bug][documentation][python] Fix placement api doc #3638
- [cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
- [feature][python] add bceloss #3804
- [enhancement][feature][python] add l1 loss op in python #3793
工具链
更多的 SWIG 接口由 Pybind11 替换
- [documentation][tooling] Add api docs zzk #3680
- [documentation][tooling] Add api docs zzk #3587
- [cfg][enhancement][tooling] Cfg template operator reform #3861
- [cfg][enhancement][tooling] Dev use union instead of struct for oneof #3870
- [cfg][enhancement][tooling] Sort cfg obj forward declare #3844
- [enhancement][tooling] Dev move run instruction to pybind #3775
- [bug][cfg][tooling] fix cfg module load error bug #3815
- [bug][tooling] Fix oneflow worker launch in py35 #3778
- [bug][cfg][tooling] Fix cfg sub proto mudule process bug #3729
- [enhancement][tooling] Dev data onerec #3104
- [cfg][enhancement][tooling] Dev compare cfg file #3717
- [bug][tooling] remove proton not related to Instruction #3708
- [bug][cfg][tooling] Dev switch instruction to cfg instruction #3702
- [cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
- [api][enhancement][refactor][tooling] Refine custom op build #3925
- [enhancement][tooling] default show cpp error stack frame #3948
- [cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
- [cfg][enhancement][tooling] optimize cfg generator to save time #3906
- [enhancement][feature][tooling] Py kernel2 #3686
编译
修复 NVCC 参数,C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量,修复编译可能出现的 make -j
,修复手动编译的时候 include 目录消失
- [build][documentation] fix readme #3694
- [bug][build] fix missing symbol when load so #3676
- [bug][build] Fix CUDA_NVCC_GENCODES #3869
- [build][documentation] Add info in readme about how to build oneflow in docker #3781
- [build][ci][enhancement] Add bazel_cache dir for XLA build #3766
- [bug][build] fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
- [build][ci][enhancement] Refactor build script #3698
- [bug][build] fix make -j in grpc and openssl #3724
- [bug][build] detect cxx11 abi availibility in cmake #3709
- [bug][build] fix include files not copied #3907
CI
提升运行速度和稳定性,支持分布式环境
- [bug][ci] test use uuid log dir #3689
- [ci][enhancement] Run check_license_and_format in every branch #3683
- [ci][feature][test] Parallel run op cases #3670
- [ci][enhancement] Run xla and pure cpu only when cuda test succeeds #3679
- [ci][documentation][enhancement] add requirements.txt for api-docs #3671
- [ci][enhancement] ci add label check workflow #3664
- [ci][enhancement] CI merge all jobs into one #3868
- [ci][enhancement] Check label every push #3863
- [ci][enhancement] Update hard coded host affiliations #3847
- [ci][enhancement] External PR skip oss steps #3843
- [ci][enhancement] ci use pull_request ev #3842
- [ci][enhancement] ci only use pull_request_target #3840
- [ci][enhancement] Add pull_request_target to allow forks access secrets when CI triggerd #3837
- [ci][enhancement] CI run when bot is requested review #3831
- [ci][enhancement] Prevent CI failure #3830
- [ci][enhancement] ci dont test 2n8c #3786
- [ci][enhancement] upload bin to oss #4000
- [ci][enhancement][test] larger tol for bn #3965
- [bug][ci] fix oss list file 100 limit #3935
- [ci][enhancement] Refine release oss url #3924
- [ci][enhancement] Build master whl once a day #3894
- [ci][feature] Multi node support in CI #3735
Test
修复 image resize 测试用例