字符集是非英语国家人最头疼的事情,尤其是样样有国标的中国。所以本朝的码农比洋大人程序员学各种技能都要多会一个技能点——应付编码问题。
NIO我们同样需要面对编码解码问题。
* 六、字符集:CharSet
* 编码:字符串 -> 字节数组
* 解码:字节数组 -> 字符串
有哪些编码呢?
@Test
public void test5(){
SortedMap<String, Charset> availableCharsets = Charset.availableCharsets();
for(Entry<String, Charset> entry:availableCharsets.entrySet()){
System.out.println(String.format("%s: %s", entry.getKey(), entry.getValue()));
}
}
输出了NIO支持的各种编码:
Big5: Big5
Big5-HKSCS: Big5-HKSCS
CESU-8: CESU-8
EUC-JP: EUC-JP
EUC-KR: EUC-KR
GB18030: GB18030
GB2312: GB2312
GBK: GBK
IBM-Thai: IBM-Thai
IBM00858: IBM00858
IBM01140: IBM01140
IBM01141: IBM01141
IBM01142: IBM01142
IBM01143: IBM01143
IBM01144: IBM01144
IBM01145: IBM01145
IBM01146: IBM01146
IBM01147: IBM01147
IBM01148: IBM01148
IBM01149: IBM01149
IBM037: IBM037
IBM1026: IBM1026
IBM1047: IBM1047
IBM273: IBM273
IBM277: IBM277
IBM278: IBM278
IBM280: IBM280
IBM284: IBM284
IBM285: IBM285
IBM290: IBM290
IBM297: IBM297
IBM420: IBM420
IBM424: IBM424
IBM437: IBM437
IBM500: IBM500
IBM775: IBM775
IBM850: IBM850
IBM852: IBM852
IBM855: IBM855
IBM857: IBM857
IBM860: IBM860
IBM861: IBM861
IBM862: IBM862
IBM863: IBM863
IBM864: IBM864
IBM865: IBM865
IBM866: IBM866
IBM868: IBM868
IBM869: IBM869
IBM870: IBM870
IBM871: IBM871
IBM918: IBM918
ISO-2022-CN: ISO-2022-CN
ISO-2022-JP: ISO-2022-JP
ISO-2022-JP-2: ISO-2022-JP-2
ISO-2022-KR: ISO-2022-KR
ISO-8859-1: ISO-8859-1
ISO-8859-13: ISO-8859-13
ISO-8859-15: ISO-8859-15
ISO-8859-2: ISO-8859-2
ISO-8859-3: ISO-8859-3
ISO-8859-4: ISO-8859-4
ISO-8859-5: ISO-8859-5
ISO-8859-6: ISO-8859-6
ISO-8859-7: ISO-8859-7
ISO-8859-8: ISO-8859-8
ISO-8859-9: ISO-8859-9
JIS_X0201: JIS_X0201
JIS_X0212-1990: JIS_X0212-1990
KOI8-R: KOI8-R
KOI8-U: KOI8-U
Shift_JIS: Shift_JIS
TIS-620: TIS-620
US-ASCII: US-ASCII
UTF-16: UTF-16
UTF-16BE: UTF-16BE
UTF-16LE: UTF-16LE
UTF-32: UTF-32
UTF-32BE: UTF-32BE
UTF-32LE: UTF-32LE
UTF-8: UTF-8
windows-1250: windows-1250
windows-1251: windows-1251
windows-1252: windows-1252
windows-1253: windows-1253
windows-1254: windows-1254
windows-1255: windows-1255
windows-1256: windows-1256
windows-1257: windows-1257
windows-1258: windows-1258
windows-31j: windows-31j
x-Big5-HKSCS-2001: x-Big5-HKSCS-2001
x-Big5-Solaris: x-Big5-Solaris
x-euc-jp-linux: x-euc-jp-linux
x-EUC-TW: x-EUC-TW
x-eucJP-Open: x-eucJP-Open
x-IBM1006: x-IBM1006
x-IBM1025: x-IBM1025
x-IBM1046: x-IBM1046
x-IBM1097: x-IBM1097
x-IBM1098: x-IBM1098
x-IBM1112: x-IBM1112
x-IBM1122: x-IBM1122
x-IBM1123: x-IBM1123
x-IBM1124: x-IBM1124
x-IBM1166: x-IBM1166
x-IBM1364: x-IBM1364
x-IBM1381: x-IBM1381
x-IBM1383: x-IBM1383
x-IBM300: x-IBM300
x-IBM33722: x-IBM33722
x-IBM737: x-IBM737
x-IBM833: x-IBM833
x-IBM834: x-IBM834
x-IBM856: x-IBM856
x-IBM874: x-IBM874
x-IBM875: x-IBM875
x-IBM921: x-IBM921
x-IBM922: x-IBM922
x-IBM930: x-IBM930
x-IBM933: x-IBM933
x-IBM935: x-IBM935
x-IBM937: x-IBM937
x-IBM939: x-IBM939
x-IBM942: x-IBM942
x-IBM942C: x-IBM942C
x-IBM943: x-IBM943
x-IBM943C: x-IBM943C
x-IBM948: x-IBM948
x-IBM949: x-IBM949
x-IBM949C: x-IBM949C
x-IBM950: x-IBM950
x-IBM964: x-IBM964
x-IBM970: x-IBM970
x-ISCII91: x-ISCII91
x-ISO-2022-CN-CNS: x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB: x-ISO-2022-CN-GB
x-iso-8859-11: x-iso-8859-11
x-JIS0208: x-JIS0208
x-JISAutoDetect: x-JISAutoDetect
x-Johab: x-Johab
x-MacArabic: x-MacArabic
x-MacCentralEurope: x-MacCentralEurope
x-MacCroatian: x-MacCroatian
x-MacCyrillic: x-MacCyrillic
x-MacDingbat: x-MacDingbat
x-MacGreek: x-MacGreek
x-MacHebrew: x-MacHebrew
x-MacIceland: x-MacIceland
x-MacRoman: x-MacRoman
x-MacRomania: x-MacRomania
x-MacSymbol: x-MacSymbol
x-MacThai: x-MacThai
x-MacTurkish: x-MacTurkish
x-MacUkraine: x-MacUkraine
x-MS932_0213: x-MS932_0213
x-MS950-HKSCS: x-MS950-HKSCS
x-MS950-HKSCS-XP: x-MS950-HKSCS-XP
x-mswin-936: x-mswin-936
x-PCK: x-PCK
x-SJIS_0213: x-SJIS_0213
x-UTF-16LE-BOM: x-UTF-16LE-BOM
X-UTF-32BE-BOM: X-UTF-32BE-BOM
X-UTF-32LE-BOM: X-UTF-32LE-BOM
x-windows-50220: x-windows-50220
x-windows-50221: x-windows-50221
x-windows-874: x-windows-874
x-windows-949: x-windows-949
x-windows-950: x-windows-950
x-windows-iso2022jp: x-windows-iso2022jp
如何编解码
方法是用Charset.forName(String)构造一个编码器或解码器,利用编码器和解码器来对CharBuffer编码,对ByteBuffer解码。
但是请注意,在对CharBuffer编码之前、对ByteBuffer解码之前,请记得对CharBuffer、ByteBuffer进行flip()切换到读模式,否则什么都没有。
如果编码和解码的格式不同,则会出现乱码。
@Test
public void test6() throws CharacterCodingException{
Charset charset1 = Charset.forName("GBK");
//获取编码器
CharsetEncoder encoder = charset1.newEncoder();
//获取解码器
CharsetDecoder decoder = charset1.newDecoder();
CharBuffer charBuffer = CharBuffer.allocate(1024);
charBuffer.put("happyBKs的博客");
//编码
charBuffer.flip();//因为编码要读取charBuffer,所以要先切到度模式
ByteBuffer byteBuffer=encoder.encode(charBuffer);
//byteBuffer.limit()为14,英文字符一个1 byte,中文字符一个2 byte
for(int i_byteBuffer=0;i_byteBuffer<byteBuffer.limit();i_byteBuffer++){
System.out.println(byteBuffer.get());
}
//解码
byteBuffer.flip();//因为解码要读取byteBuffer,所以要先切到度模式,不然下面一行什么也不输出
CharBuffer charBufferDecoded=decoder.decode(byteBuffer);
System.out.println(charBufferDecoded.toString());
System.out.println("---------------------------------------------------");
Charset utf8Chatset = Charset.forName("UTF-8");
byteBuffer.flip();//byteBuffer刚才读过了,现在需要从头再读一遍,需要先调用flip()
CharBuffer charBufferDecodedByUTF8=utf8Chatset.decode(byteBuffer);
System.out.println(charBufferDecodedByUTF8.toString());
System.out.println("---------------------------------------------------");
Charset gbkChatset = Charset.forName("GBK");
byteBuffer.flip();//byteBuffer刚才读过了,现在需要从头再读一遍,需要先调用flip()
CharBuffer charBufferDecodedByGBK=gbkChatset.decode(byteBuffer);
System.out.println(charBufferDecodedByGBK.toString());
}
输出结果:
104
97
112
112
121
66
75
115
-75
-60
-78
-87
-65
-51
happyBKs的博客
---------------------------------------------------
happyBKs�IJ���
---------------------------------------------------
happyBKs的博客
所以我们在以后对文件系统进行NIO编程时,如果出现问题,原因可以这样归类:
如果结果为空 ,那么问题时在编码或解码之前没有将缓冲区切换到读模式。
如果结果又乱码,那么是编码器或者解码器出现差错;也有可能是只输出了一部分在缓冲区,多字节字符被截断造成的。