书名：C++17 STL Cookbook
作者名：Jacek Galowicz
本章字数：588字
更新时间：2025-04-04 19:00:07

How to do it...

In this section, we will read all user input from standard input, which might, for example, be a text file containing an essay. We tokenize the input to words, in order to count which word occurs how often.

As always, we need to include all the headers from the data structures we are going to use.

      #include <iostream>
      #include <map> 
      #include <vector> 
      #include <algorithm> 
      #include <iomanip>

To spare us some typing, we declare that we use namespace std.

      using namespace std;

We will use one helper function in order to crop possibly appended commas, dots, or colons from words.

      string filter_punctuation(const string &s)
      {
          const char *forbidden {".,:; "};
          const auto  idx_start (s.find_first_not_of(forbidden));
          const auto  idx_end   (s.find_last_not_of(forbidden));
      
          return s.substr(idx_start, idx_end - idx_start + 1);
      }

Now we start with the actual program. We will collect a map that associates every word we see with a counter of that word's frequency. Additionally, we maintain a variable which records the size of the longest word we've seen so far, so we can indent the word frequency table nicely when we print it at the end of the program.

      int main()
      {
          map<string, size_t> words;
          int max_word_len {0};

When we stream from std::cin into an std::string variable, the input stream will cut out white space on the way. This way we get the input word by word.

          string s;
          while (cin >> s) {

The word which we have now, could contain a comma, dots, or a colon, because it might be at the end of a sentence or similar. We filter that out with the helper function we defined before.

              auto filtered (filter_punctuation(s));

In case this word is the longest word so far, we need to update the max_word_len variable.

              max_word_len = max<int>(max_word_len, filtered.length());

Now we will increment the counter value of the word in our words map. If it occurs for the first time, it is implicitly created before we increment it.

              ++words[filtered];
          }

After the loop terminated, we know that we saved all unique words from the input stream in the words map, paired with a counter denoting every word's frequency. The map uses words as keys and is sorted by their alphabetical order. What we want is to print all words sorted by their frequency, so the words with the highest frequency shall come first. In order to get that, we will first instantiate a vector where all these word-frequency pairs fit in and move them from the map to the vector.

          vector<pair<string, size_t>> word_counts;
          word_counts.reserve(words.size());
          move(begin(words), end(words), back_inserter(word_counts));

The vector does now still contain all word-frequency pairs in the same order as the words map maintained them. Now we sort it again, in order to have the most frequent words at the beginning, and the least frequent ones at the end.

          sort(begin(word_counts), end(word_counts),
              [](const auto &a, const auto &b) { 
                  return a.second > b.second; 
              });

All data is in order now, so we push it out to the user terminal. Using the std::setw stream manipulator, we format the data in a nicely indented format, so it looks kind of like a table.

          cout << "# " << setw(max_word_len) << "<WORD>" << " #<COUNT>\n";
          for (const auto & [word, count] : word_counts) {
              cout << setw(max_word_len + 2) << word << " #" 
                   << count << '\n';
          }
      }

After compiling the program, we can pipe any text file into it in order to get a frequency table.

      $ cat lorem_ipsum.txt | ./word_frequency_counter
      #  <WORD> #<COUNT>
 et #574
 dolor #302
 sed #273
 diam #273
 sit #259
 ipsum #259
      ...