IK Analyzer分词及词频统计Java简单实现

IK Analyzer基于Hadoop MapReducer框架Java实现：

1、新建一个ChineseWordCount类

2、在该类中再建一个私有静态类CWCMapper继承Mapper类，并复写Mapper类中map方法。

PS：Mapper的4个泛型分别为：输入key类型，通常为LongWritable，为偏移量；输入value类型；输出key类型；输出value类型

private static class CWCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
       private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
       @Override
       protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
               throws IOException, InterruptedException {
           /**
           *要注意编码格式，本次红楼梦txt文档为GBK编码格式，则需要转换编码格式
           *转换编码格式，不能先将Text对象转换为String对象
           *转换不成功：String str=value.toString(); str=new String(str.getBytes(),"编码格式");
           应该直接value.getBytes(), 再转换格式*/
           byte[] bt = value.getBytes();
           //因为红楼梦的所有txt为gbk编码格式
           String str = new String(bt, "gbk");
           Reader read = new BufferedReader(new StringReader(str));
IKSegmenter iks = new IKSegmenter(read, true);
Lexeme t;
while ((t = iks.next()) != null)
{
word.set(t.getLexemeText());
context.write(word, one);
}
       }
   }

3、同理，在该类中再建一个私有静态类CWCReducer继承Reducer类，并复写Reducer类中reduce方法。

private static class CWCReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
       @Override
       protected void reduce(Text value, Iterable datas,
               Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
       //value和datas的关系可以理解为（黛玉, list(1,1,1...)）,即迭代datas即可以获取词频
           int sum = 0;
       for (IntWritable data : datas) {
       sum += data.get();
       }
       //写入文件乱码，是因为hadoop内部默认编码格式为utf-8, 则看输出文件的时候需要把编码调成utf-8
       context.write(value, new IntWritable(sum));
       }
   }

4、写main方法实现分词及词频统计

public static void main(String[] args) {
       try {
           Configuration cfg = HadoopCfg.getConfigration();
           Job job = Job.getInstance(cfg);
           job.setJobName("ChineseWordCount");
           job.setJarByClass(ChineseWordCount.class);
           job.setMapperClass(CWCMapper.class);
           job.setMapOutputKeyClass(Text.class);
           job.setMapOutputValueClass(IntWritable.class);
           job.setReducerClass(CWCReducer.class);
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(IntWritable.class);
           FileInputFormat.addInputPath(job, new Path("/hongloumeng"));
           FileOutputFormat.setOutputPath(job, new Path("/hongloumengCount/"));
           System.exit(job.waitForCompletion(true) ? 0 : 1);
       } catch (Exception e) {
           e.printStackTrace();
       }
   }

5、最后可以利用java对输出文件做处理，取词语出现频率最高的前50个，丢掉我，你，他，的等词

public class SortWords {
   private static final int NUM=200;

   public static void main(String[] args) {
       try {
           BufferedReader br = new BufferedReader(new FileReader(new File("hong-words.txt")));
           Map<String, Integer> map = new HashMap<String, Integer>();
           String line = br.readLine();
           while(!"".equals(line) && line!=null){
               String[] strs = line.split("   ");

               if(strs[0].length()<2){
                   line = br.readLine();
                   continue;
               }else{
                   map.put(strs[0], Integer.parseInt(strs[1]));
                   line = br.readLine();
               }
           }
           List<Map.Entry<String, Integer>> list =
                   new ArrayList<Map.Entry<String, Integer>>(map.entrySet());
           Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {

               public int compare(Entry<String, Integer> o1,
                       Entry<String, Integer> o2) {
                   return o1.getValue()-o2.getValue();
               }
           });
           BufferedWriter bw = new BufferedWriter(new FileWriter(new File("hong-words2.txt")));

           if(list!=null){
               for(int i=list.size()-1; i>=list.size()-NUM; i--){
                   bw.write(list.get(i).toString()+"\n");
               }
           }
           br.close();
           bw.close();

       } catch (FileNotFoundException e) {
           e.printStackTrace();
       } catch (IOException e) {
           e.printStackTrace();
       }
   }

}

结果展示：

宝玉=14181
笑道=8841
听了=4422
黛玉=4313
我们=4248
一个=4113
宝钗=4075
去了=4031
凤姐=3930
如今=3890
什么=3846
你们=3734
姑娘=3470
王夫人=3447
众人=3375
他们=3290
说道=3261
那里=3228
来了=3126
一面=3111
奶奶=2977
太太=2929
自己=2817
袭人=2804
老太太=2650
不知=2578
这样=2565
这个=2548
老爷=2452
只见=2434
出来=2351
两个=2296
咱们=2244
这里=2175
湘云=2169
怎么=2136
起来=2128
大家=2122
丫头=2106
只是=2087
所以=1980
也是=1945
知道=1930
姐姐=1918
姨妈=1854
告诉=1803
不是=1769
这些=1756
的人=1731
只得=1730