- C++17 STL Cookbook
- Jacek Galowicz
- 751字
- 2025-04-04 19:00:07
How to do it...
In this section, we will read all user input from standard input, which we will tokenize by whole sentences, and not words. Then we will collect all sentences in an std::multimap paired with a variable carrying their length. Afterward, we output all sentences, sorted by their length, back to the user.
- As usual, we need to include all needed headers. std::multimap comes from the same header as std::map.
#include <iostream>
#include <iterator>
#include <map>
#include <algorithm>
- We use a lot of functions from namespace std, so we declare its use automatically.
using namespace std;
- When we tokenize strings by extracting what's between dot characters in the text, we will get text sentences surrounded by white space such as spaces, new line symbols, and so on. These would increase their size in a wrong way, so we filter them out using a helper function, which we now define.
string filter_ws(const string &s)
{
const char *ws {" \r\n\t"};
const auto a (s.find_first_not_of(ws));
const auto b (s.find_last_not_of(ws));
if (a == string::npos) {
return {};
}
return s.substr(a, b);
}
- The actual sentence length counting function shall take a giant string containing all the text, and then return an std::multimap, which maps sorted sentence lengths to the sentences.
multimap<size_t, string> get_sentence_stats(const string &content)
{
- We begin by declaring the multimap structure, which is intended to be the return value, and some iterators. As we will have a loop, we need an end iterator. Then we use two iterators in order to point to consecutive dots within the text. Everything between is a text sentence.
multimap<size_t, string> ret;
const auto end_it (end(content));
auto it1 (begin(content));
auto it2 (find(it1, end_it, '.'));
- The it2 will be always one dot further than it1. As long as it1 did not reach the end of the text, we are fine. The second condition checks whether it2 is really at least some characters further. If that was not the case, there would be no characters left to read between them.
while (it1 != end_it && distance(it1, it2) > 0) {
- We create a string from all characters between the iterators, and filter all white space from its beginning and end, in order to count the length of the pure sentence.
string s {filter_ws({it1, it2})};
- It's possible that the sentence does not contain anything other than white space. In that case, we simply drop it. Otherwise, we count its length by determining how many words there are. This is easy, as there are single spaces between all words. Then we save the word count together with the sentence in the multimap.
if (s.length() > 0) {
const auto words (count(begin(s), end(s), ' ') + 1);
ret.emplace(make_pair(words, move(s)));
}
- For the next loop iteration, we put the leading iterator it1 on the next sentence's dot character. The following iterator it2 is put one character after the old position of the leading iterator.
it1 = next(it2, 1);
it2 = find(it1, end_it, '.');
}
- After the loop is terminated, the multimap contains all sentences paired with their word count and can be returned.
return ret;
}
- Now we put the function to use. First, we tell std::cin to not skip white space, as we want sentences with spaces in one piece. In order to read the whole file, we initialize an std::string from input stream iterators which encapsulate std::cin.
int main()
{
cin.unsetf(ios::skipws);
string content {istream_iterator<char>{cin}, {}};
- As we only need the multimap result for printing, we put the get_sentence_stats call directly in the loop and feed it with our string. In the loop body, we print the items line by line.
for (const auto & [word_count, sentence]
: get_sentence_stats(content)) {
cout << word_count << " words: " << sentence << ".\n";
}
}
- After compiling the code, we can feed the app with text from any text file. An example Lorem Ipsum text yields the following output. As the output is very long for long text with many sentences, it prints the shortest sentences first and the longest last. This way we see the longest sentences first as terminals usually scroll to the end of the output automatically.
$ cat lorem_ipsum.txt | ./sentence_length
...
10 words: Nam quam nunc, blandit vel, luctus pulvinar,
hendrerit id, lorem.
10 words: Sed consequat, leo eget bibendum sodales,
augue velit cursus nunc,.
12 words: Cum sociis natoque penatibus et magnis dis
parturient montes, nascetur ridiculus mus.
17 words: Maecenas tempus, tellus eget condimentum rhoncus,
sem quam semper libero, sit amet adipiscing sem neque sed ipsum.