一、介绍
TxtFileReader提供了读取本地文件系统数据存储的能力。在底层实现上,TxtFileReader获取本地文件数据,并转换为DataX传输协议传递给Writer。
二、配置模版
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["/home/haiwei.luo/case00/data"],
"encoding": "UTF-8",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "boolean"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "date",
"format": "yyyy.MM.dd"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"path": "/home/haiwei.luo/case00/result",
"fileName": "luohw",
"writeMode": "truncate",
"format": "yyyy-MM-dd"
}
}
}
]
}
}
三、使用说明
支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。
支持类CSV格式文件,自定义分隔符。
支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量
四、实践
最近需要导一张表,原来的表数据是存放在hive上的,利用python脚本处理数据之后直接插入到hive的。现在是要将这张表的数据导入到greenplum中。表数据在7200万左右
方法:将hive数据导出成csv文件,利用datax导入到greenplum
开干:
配置json文件
{
"content":[
{
"reader":{
"name":"txtfilereader",
"parameter":{
"column":[
{
"format":"yyyy-MM-dd",
"index":0,
"type":"date"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"string"
},
{
"index":3,
"type":"string"
},
{
"index":4,
"type":"string"
},
{
"index":5,
"type":"long"
},
{
"index":6,
"type":"long"
},
{
"index":7,
"type":"long"
},
{
"index":8,
"type":"long"
}
],
"encoding":"utf-8",
"fieldDelimiter":",",
"path":[
"/home/tianyafu/flux_timecount_action.csv"
]
}
},
"writer":{
"name":"gpdbwriter",
"parameter":{
"column":[
"record_date",
"outid",
"tm_type",
"serv",
"app",
"down_flux",
"up_flux",
"seconds",
"count"
],
"connection":[
{
"jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods",
"table":[
"ods_flux_timecount_action"
]
}
],
"password":"******",
"segment_reject_limit":0,
"username":"admin"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"channel":"1"
}
}
}
然后就失败了呀
确定错误是数据中有null值,无法转换为Long类型。
查询到解决方法是添加:
nullFormat配置项
nullFormat
描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
例如如果用户配置: nullFormat:"\N",那么如果源头数据是"\N",DataX视作null字段。
必选:否
默认值:\N
那就加上呗,
{
"content":[
{
"reader":{
"name":"txtfilereader",
"parameter":{
"column":[
{
"format":"yyyy-MM-dd",
"index":0,
"type":"date"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"string"
},
{
"index":3,
"type":"string"
},
{
"index":4,
"type":"string"
},
{
"index":5,
"type":"long"
},
{
"index":6,
"type":"long"
},
{
"index":7,
"type":"long"
},
{
"index":8,
"type":"long"
}
],
"csvReaderConfig":{
"safetySwitch":false,
"skipEmptyRecords":false,
"useTextQualifier":false
},
"encoding":"utf-8",
"fieldDelimiter":",",
"nullFormat":"null",
"path":[
"/home/tianyafu/flux_timecount_action.csv"
]
}
},
"writer":{
"name":"gpdbwriter",
"parameter":{
"column":[
"record_date",
"outid",
"tm_type",
"serv",
"app",
"down_flux",
"up_flux",
"seconds",
"count"
],
"connection":[
{
"jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods",
"table":[
"ods_flux_timecount_action"
]
}
],
"password":"******",
"segment_reject_limit":0,
"username":"admin"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"channel":"1"
}
}
}
结果又失败了
看来是大小写敏感的,继续改:
{
"content":[
{
"reader":{
"name":"txtfilereader",
"parameter":{
"column":[
{
"format":"yyyy-MM-dd",
"index":0,
"type":"date"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"string"
},
{
"index":3,
"type":"string"
},
{
"index":4,
"type":"string"
},
{
"index":5,
"type":"long"
},
{
"index":6,
"type":"long"
},
{
"index":7,
"type":"long"
},
{
"index":8,
"type":"long"
}
],
"csvReaderConfig":{
"safetySwitch":false,
"skipEmptyRecords":false,
"useTextQualifier":false
},
"encoding":"utf-8",
"fieldDelimiter":",",
"nullFormat":"NULL",
"path":[
"/home/tianyafu/flux_timecount_action.csv"
]
}
},
"writer":{
"name":"gpdbwriter",
"parameter":{
"column":[
"record_date",
"outid",
"tm_type",
"serv",
"app",
"down_flux",
"up_flux",
"seconds",
"count"
],
"connection":[
{
"jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods",
"table":[
"ods_flux_timecount_action"
]
}
],
"password":"******",
"segment_reject_limit":0,
"username":"admin"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"channel":"1"
}
}
}
终于成功了
看来这个参数是大小写敏感的