IK Analyzer基于Hadoop MapReducer框架Java实现:
1、新建一个ChineseWordCount类
2、在该类中再建一个私有静态类CWCMapper继承Mapper类,并复写Mapper类中map方法。
PS:Mapper的4个泛型分别为:输入key类型,通常为LongWritable,为偏移量;输入value类型;输出key类型;输出value类型
private static class CWCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
/**
*要注意编码格式,本次红楼梦txt文档为GBK编码格式,则需要转换编码格式
*转换编码格式,不能先将Text对象转换为String对象
*转换不成功:String str=value.toString(); str=new String(str.getBytes(),"编码格式");
应该直接value.getBytes(), 再转换格式*/
byte[] bt = value.getBytes();
//因为红楼梦的所有txt为gbk编码格式
String str = new String(bt, "gbk");
Reader read = new BufferedReader(new StringReader(str));
IKSegmenter iks = new IKSegmenter(read, true);
Lexeme t;
while ((t = iks.next()) != null)
{
word.set(t.getLexemeText());
context.write(word, one);
}
}
}
3、同理,在该类中再建一个私有静态类CWCReducer继承Reducer类,并复写Reducer类中reduce方法。
private static class CWCReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text value, Iterabledatas,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//value和datas的关系可以理解为(黛玉, list(1,1,1...)),即迭代datas即可以获取词频
int sum = 0;
for (IntWritable data : datas) {
sum += data.get();
}
//写入文件乱码,是因为hadoop内部默认编码格式为utf-8, 则看输出文件的时候需要把编码调成utf-8
context.write(value, new IntWritable(sum));
}
}
4、写main方法实现分词及词频统计
public static void main(String[] args) {
try {
Configuration cfg = HadoopCfg.getConfigration();
Job job = Job.getInstance(cfg);
job.setJobName("ChineseWordCount");
job.setJarByClass(ChineseWordCount.class);
job.setMapperClass(CWCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(CWCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/hongloumeng"));
FileOutputFormat.setOutputPath(job, new Path("/hongloumengCount/"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
e.printStackTrace();
}
}
5、最后可以利用java对输出文件做处理,取词语出现频率最高的前50个,丢掉我,你,他,的等词
public class SortWords {
private static final int NUM=200;
public static void main(String[] args) {
try {
BufferedReader br = new BufferedReader(new FileReader(new File("hong-words.txt")));
Map<String, Integer> map = new HashMap<String, Integer>();
String line = br.readLine();
while(!"".equals(line) && line!=null){
String[] strs = line.split(" ");if(strs[0].length()<2){
line = br.readLine();
continue;
}else{
map.put(strs[0], Integer.parseInt(strs[1]));
line = br.readLine();
}
}
List<Map.Entry<String, Integer>> list =
new ArrayList<Map.Entry<String, Integer>>(map.entrySet());
Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {public int compare(Entry<String, Integer> o1,
Entry<String, Integer> o2) {
return o1.getValue()-o2.getValue();
}
});
BufferedWriter bw = new BufferedWriter(new FileWriter(new File("hong-words2.txt")));if(list!=null){
for(int i=list.size()-1; i>=list.size()-NUM; i--){
bw.write(list.get(i).toString()+"\n");
}
}
br.close();
bw.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}}
结果展示:
宝玉=14181
笑道=8841
听了=4422
黛玉=4313
我们=4248
一个=4113
宝钗=4075
去了=4031
凤姐=3930
如今=3890
什么=3846
你们=3734
姑娘=3470
王夫人=3447
众人=3375
他们=3290
说道=3261
那里=3228
来了=3126
一面=3111
奶奶=2977
太太=2929
自己=2817
袭人=2804
老太太=2650
不知=2578
这样=2565
这个=2548
老爷=2452
只见=2434
出来=2351
两个=2296
咱们=2244
这里=2175
湘云=2169
怎么=2136
起来=2128
大家=2122
丫头=2106
只是=2087
所以=1980
也是=1945
知道=1930
姐姐=1918
姨妈=1854
告诉=1803
不是=1769
这些=1756
的人=1731
只得=1730
写在最后:本人初次接触Hadoop及IK Analyzer分词工具,若有不正确的地方,望多多指教!下次将更新jieba分词Python实现。
分享:http://my.oschina.net/apdplat/blog/412921?fromerr=bQjYmVTB