C++回顾统计词频问题

系统教程610 更新时间：2026-04-04 06:04:54

2023年11月29日发(作者：)

C++回顾统计词频问题--vector、map、hash_map（三种⽅式时

间⽐较）

本博⽂我们通过三个程序⽐较统计词频问题的时间复杂度问题（末尾有⽤时及其分析）；

问题描述;

1）、找⼀篇⽂章，将所有单词输⼊⾄程序；（The Bible Holy为例）

2）、统计出每个单词的数量，即词频问题；

3）、增加停⽤词功能；（遇到此类词，直接略过）（⽹上搜）

4）、分别统计出读取⽂件并计算词频时间、排序所⽤时间；

5）、⽤类实现各函数（处统计时间的函数除外）。

vector、map、hash_map 都要处理字符串的去除标点符号、将⼤写字母转换成⼩写字母、不对数字进⾏统计问题。因此，我们可以将处

理这些函数进⾏封装。

StringUtil.h

1 #ifndef STRING_UTIL_H_

2 #define STRING_UTIL_H_

3

4 #include <string>

5

6 namespace stringutil

7 {

8 void erasePunct(std::string &s);

9 void stringToLower(std::string &s);

10 bool isAllDigit(const std::string &s);

11 }

12

13 #endif /* STRING_UTIL_H_ */

1 #include "StringUtil.h"

2 #include

3

4 using namespace std;

5

6 namespace stringutil

7 {

8

9 void erasePunct(string &s)

10 {

11 string::iterator it = ();

12 while(it != ())

13 {

14 if(ispunct(*it))

15 it = (it);

16 else

17 ++it;

18 }

19 }

20 void stringToLower(string &s)

21 {

22 for(string::iterator it = ();

23 it != ();

24 ++it)

25 {

26 if(isupper(*it))

27 *it = tolower(*it);

28 }

29 }

30

31 bool isAllDigit(const std::string &s)

32 {

33 for(string::const_iterator it = ();

34 it != ();

35 ++it)

36 {

42 }

43

44 }

⼀、采⽤vector、pair；

采⽤的数据结构：

1）、vector stopList⽤于存放停⽤词；

2）、vector > words ⽤于存放除禁⽤词之外的其他The Bible Holy 的单词；

思路：

1）、⽤ReadStopFile函数读取停⽤词到 stopList中；

2）、⽤ReadWordFile函数读取The Bible Holy到 words中；在我们读取的过程中⾸先要去除标点符号、将word转化为⼩写、跳过数字；

3）、然后判断该单词是否在 stoplist 中，若不在，最后添加单词⾄ words中。

4）、对词频排序，这⾥我们采⽤algorithm库中的sort函数；

5）、对words vector进⾏遍历，打印之；

这⾥注意⼀点：除了在中显式调⽤的函数，我们将其他函数声明为类的私有函数。

函数声明如下：

1 #ifndef WORDFREQUENCY_H_

2 #define WORDFREQUENCY_H_

3

4 #include

5 #include <string>

6 #include

7

8 class WordFrequency

9 {

10 public:

11 WordFrequency(const std::string &filename, const std::string &stopFile);//初始化

12 void ReadStopFile();

13 void ReadWordFile();

14 void sortWordByFrequency();

15 void printWordFrequency()const;

16 private:

17 void addWordToDict(const std::string &word);

18 bool isStopWord(const std::string &word)const;

19

20 typedef std::vectorstring, int> >::iterator Wordit;//为了⽅便

21 typedef std::vectorstring, int> >::const_iterator Wordkit;

22

23 std::string filename_;

24 std::string stopFile_;

25

26 std::vectorstring> stopList_;

27 std::vectorstring, int> > words_;

28

29 };

30

31 #endif

函数实现如下：

1 //

2 #include "WordFrequency.h"

3 #include "StringUtil.h"

4 #include

5 #include

6 #include

7 #include

8

9 using namespace std;

10 using namespace stringutil;

11

12 WordFrequency::WordFrequency(const std::string &filename, const std::string &stopFile)

13 :filename_(filename),stopFile_(stopFile)

14 { }

15

16 void WordFrequency::ReadStopFile()

17 {

18 ifstream in(stopFile_.c_str());

19 if( !in )

20 throw runtime_error("open file failure");

21 string word;

22 while(in >> word)

23 {

24 stopList_.push_back(word);

25 }

26 in.close();

27 }

28

29 void WordFrequency::ReadWordFile()

30 {

31 ifstream infile(filename_.c_str());

32 if( !infile )

33 throw runtime_error("open file failure!");

34 string word;

35 while(infile >> word)

36 {

37 erasePunct(word);

38 if( isAllDigit(word))

39 continue;

40 stringToLower(word);

41 if( !isStopWord(word))

42 addWordToDict(word);

43 }

44 ();

45 }

46

47 bool WordFrequency::isStopWord(const string &word)const

48 {

49 vector<string>::const_iterator it = stopList_.begin();

50 while( it != stopList_.end())

51 {

52 if( *it == word)

53 break;

54 it ++;

55 }

56 return (it != stopList_.end());

57 }

58

59 void WordFrequency::addWordToDict(const string &word)

60 {

61 Wordit it = words_.begin();

62 while( it != words_.end())

63 {

64 if( it->first == word)

65 {

66 ++ it->second ;

67 break;

68 }

7 #include "WordFrequency.h"

8 #include

9 #include

10 #include

11 #include

12 #include

13 #include <string.h>

14 using namespace std;

15

16 int64_t gettime();

17

18 int main(int argc, const char *argv[])

19 {

20 if(argc < 3)

21 throw runtime_error("exe,filename, stopfile");

22 int64_t starttime =gettime();

23 WordFrequency wf(argv[1], argv[2]);

24 opFile();

25 rdFile();

26

27 int64_t readtime = gettime();

28 rdByFrequency();

29

30 int64_t sorttime = gettime();

31

32 printf("readfile time: %lld msn",(readtime-starttime)/1000);

33 printf("sortfile time: %lld msn",(sorttime-readtime)/1000);

34

35 ordFrequency();

36 return 0;

37 }

38

39 int64_t gettime()

40 {

41 struct timeval tv;

42 ::memset(&tv, 0, sizeof tv);

43 if(::gettimeofday(&tv, NULL) == -1)

44 {

45 throw runtime_error("gettimeofday");

46 }

47 int64_t t = _usec;

48 t += _sec * 1000 * 1000;

49 return t;

50 }

⼆、map、set；

相较于第⼀种，这种⽅式⽤到的数据结构有

1、std::set StopList_ ⽤于存放停⽤词；

2、std::map words_ ⽤于存放The Bible Holy 中除了停⽤词之外的词；

3、std::vector< std::pair > copyWords_ 由于map不具备排序功能，故需要将其转存均vector中。

WordFrequency.h

1 #ifndef WORDFREQUENCY_H_

2 #define WORDFREQUENCY_H_

3

4 #include

5 #include

6 #include <set>

7 #include <string>

8 class WordFrequency

9 {

10 public:

11 WordFrequency(const std::string &filename, const std::string &stopfile);

12 void ReadStopFile();

13 void ReadWordFile();

14

15 void copyWordToVector();

16 void sortWordByFrequency();

17 void printFrequency()const;

18

19 private:

20 std::string filename_;

21 std::string stopfile_;

22

23 std::setstring> StopList_;

24 std::mapstring, int> words_;

25

26 std::vector< std::pairstring,int> > copyWords_;

27 };

28

29 #endif

1 //

2 #include "WordFrequency.h"

3 #include "StringUtil.h"

4 #include

5 #include

6 #include

7 #include

8 using namespace std;

9 using namespace stringutil;

10

11 WordFrequency::WordFrequency(const string &filename, const string &stopfile)

12 :filename_(filename),stopfile_(stopfile)

13 { }

14

15 void WordFrequency::ReadStopFile()

16 {

17 ifstream infile(stopfile_.c_str());

18 if( !infile )

19 throw runtime_error("open file failure");

20 string word;

21 while(infile >> word)

22 StopList_.insert(word);

23

24 ();

25 }

26

27 void WordFrequency::ReadWordFile()

28 {

29 ifstream infile(filename_.c_str());

30 if( !infile )

31 throw runtime_error("open file failure");

32 words_.clear();

33 string word;

34 while(infile >> word) //读取单词

35 {

36 erasePunct(word); //去除标点

37 stringToLower(word);//转为⼩写

38 if(isAllDigit(word))//去除数字

39 continue;

40 if( StopList_.count(word) == 0) //set中 count 标识 word这个单词是否存在

41 words_[word]++; //如果存在，int++，如果不存在，增添⾄ map中

42 //words_[word]= cnt（词频数）

43 }

44

45 ();

46 }

47

48 void WordFrequency::copyWordToVector()

49 {

50 copyWords_.clear();

51 //back_inserter(copyWords_)

52 //front_inserter

53 copy(words_.begin(),words_.end(),back_inserter(copyWords_));

54 }

55

56 bool cmp(const pair<string, int> &a, const pair<string, int> &b)

57 {

58 return > ;

59 }

60

61 void WordFrequency::sortWordByFrequency()

62 {

63 sort(copyWords_.begin(), copyWords_.end(),cmp); //排序

64 }

65

66 void WordFrequency::printFrequency() const

67 {

74 }

：

⽐上⼀程序的main函数多了⼀⾏

1 //多出的⼀⾏copyWordToVector

2 int64_t readtime = gettime();

3 rdToVector();

4 rdByFrequency();

三、哈希表（c11特性）：

程序三与程序⼆⾮常相似，只是其头⽂件WordFrequency.h不同，因此我们只给出 .h⽂件；

注意编译时，g++ *.cpp -std=c++0x

WordFrequency.h

1 //WordFrequency.h

2 #ifndef WORDFREQUENCY_H_

3 #define WORDFREQUENCY_H_

4

5 #include

6 #include

7 #include

8 #include <string>

9 class WordFrequency

10 {

11 public:

12 WordFrequency(const std::string &filename, const std::string &stopfile);

13 void ReadStopFile();

14 void ReadWordFile();

15

16 void copyWordToVector();

17 void sortWordByFrequency();

18 void printFrequency()const;

19

20 private:

21 std::string filename_;

22 std::string stopfile_;

23

24 std::unordered_setstring> StopList_;

25 std::unordered_mapstring, int> words_;

26

27 std::vector< std::pairstring,int> > copyWords_;

28 };

29

30 #endif

以上三个程序（所采⽤的数据结构不同）性能⽐较：

程序⼀：读取⽂件时间：42997ms、排序时间 7ms；

程序⼆：读取⽂件时间：1294ms、排序时间 10ms；

程序三：读取⽂件时间： 616ms、排序时间 9ms；

分析：

读取时间：

程序⼀采⽤的是vector，所以每⼀次从 txt 读取⼀个单词都要先遍历⼀遍停⽤词的vector，然后再顺序遍历⼀次 vector，以查找该单词是否

已经存在，故其平均查找时间复杂度为 O（N*N）；

程序⼆采⽤的是 map与set。map的实现是根据平衡⼆叉树（具体是红⿊树），其查找平均速度是n*lgN；⽽且如果不存在时，⾃动添加该

元素⾄map中，词频置为1。

程序三采⽤的是哈希表， unordered_map，与unordered_set ，其查找时间复杂度为常量O(1)；

排序时间：

程序⼀所⽤时间⽐程序⼆、三多了⼏毫秒，这是因为map不具备排序功能，因为要转存⾄vector中。⽽多出来的这⼏毫秒就是转存时间。

本文发布于:2023-11-29，感谢您对本站的认可！

本文链接:https://www.fzithome.com/xitong/1701268016a26741.html

C++回顾统计词频问题

发布评论取消回复

最近发表

相关推荐

标签列表

C++回顾统计词频问题

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复