# 使用正则统计英文文章中的高频词汇

# 1 需求分析

某英语老师需要统计完型填空、阅读理解等题型中出现的高频词汇，将这些词汇整理后让学生加强学习与记忆。

（1）只统计英文词汇，文章中的中文需要剔除；

（2）统计出高频的词汇及出现次数；

（3）按出现频率从高到低的顺序给出有限的结果。

# 2 处理过程

（1）从文本文件读入文章，放到内存中等待处理；

（2）使用正则匹配中英文单词，不包含中文和其它特殊字符；

（3）将匹配出的单词存起来，并统计出现次数；

（4）按出现次数从高到低进行排序；

（5）取前N个结果；

# 3 代码实现

public class WordCounter {
    //储存次数的Map
    private static Map<String, Integer> countMap = Maps.newHashMap();
    //正则匹配
    private static String regex = "[a-zA-Z]+";
    private static Pattern pattern = Pattern.compile(regex);
    //总个数
    private static int sumCount = 0;

    public static void main(String[] args) throws IOException {
        //文件：包含要处理的文章
        String filePath = "C:\\input2.txt";
        //使用Gavua的库读取文件，返回所有行
        List<String> lines = Files.readLines(new File(filePath), Charset.forName("UTF-8"));
        //遍历处理所有行
        lines.forEach(WordCounter::processLine);

        //排序，取前20个结果
        countMap.entrySet().stream()
                .sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder()))
                .limit(20)
                .forEach(
                        entry -> System.out.println(entry.getKey() + "\t" + entry.getValue())
                );
        // 单词总数
        System.out.println("Total:" + sumCount);
    }


    //处理一行字符串
    private static void processLine(String line) {
        Matcher matcher = pattern.matcher(line);
        while (matcher.find()) {
            sumCount++;
            String word = matcher.group();
            //取单词小写，大小写不区分统计
            word = word.toLowerCase();
            //如果Map中没有，则表示第一次出现；有则次数加1
            if (null == countMap.get(word)) {
                countMap.put(word, 1);
            } else {
                int count = countMap.get(word);
                countMap.put(word, count + 1);
            }
        }
    }
}

# 4 结果展示

本次例子选取了三篇演讲，并且文章中有中文和英文：

Emma Watson: Gender equality is your issue too

Martin Luther King: I have a dream

Obama: This is your victory

统计后前20个单词如下：

# 5 总结

本文实例主要实现了读取文件，并处理英文词频的问题。仍然有很多问题需要考虑，如考题会出现很多单独字母A、B、C、D，则需要正则把它们剔除；如果文件太大，则不能load入内存再处理；扩展成多线程来处理等。

← 统计String单词数的三种方法 String.intern()原来还能这么用（原理与应用） →