CodeProject: C++ Strtk Tokenizer. Free source code and programming help

Broken Article?

Discuss

Send to a friend

9 votes for this Article.

Popularity: 4.65 Rating: 4.87 out of 5

Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences. The indicies in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or an map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the most likely candidate would be the implementation in the BOOST library. Proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, genericity of design, performance.

1. Over-all usage patterns

Here we are mainly concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should or there be such a thing as an empty token? and what do preceding and trailing delimiters mean?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub -string in question from the original sequence, then returned back (some of this may be alleviated by RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range (BOOST split already has this feature) and allow the caller to decide how they wish to inspect the token. This minor change in interface design provides a great deal of flexibility and performance gain.

3. Genericity of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance

Tokenizing of strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process.

Getting started

StrTk provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

Input	Token List
`a`	`a`
`a\|b`	`a,b`
`a\|\|b`	`a,<>,b`
`\|a`	`<>,a`
`a\|`	`a,<>`
`\|a\|\|b`	`<>,a,<>,b`
`\|\|a\|\|b\|\|`	`<>,<>,a,<>,b,<>,<>`
`\|`	`<>,<>`
`\|\|`	`<>,<>,<>`
`\|\|\|`	`<>,<>,<>,<>`

Compressed Delimiters

Input	Token List
`a`	`a`
`a\|\|b`	`a,b`
`\|a\|\|b`	`<>,a,b`
`\|\|a\|\|b\|\|`	`<>,a,b,<>`
`\|`	`<>,<>`
`\|\|`	`<>,<>`
`\|\|\|`	`<>,<>`

Delimiters

Two forms of delimiters are supported and they are single and multiple delimiters SDP and MDP respectively. Essentially a SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single delimiter predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicatestring::value_type> predicate('|');

Multiple delimiter predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

unsigned int delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> predicate(delimiters,delimiters + 5);

Multiple char delimiter predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

std::string s = "abc|123|xyz|789";
std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(s,predicate,std::back_inserter(token_list));

Adding "true" as the last parameter to the function call of strtk::split will enable compressed delimiters mode, by default it is set to false.

strtk::split(s,predicate,std::back_inserter(token_list),true);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string s = "abc?123;xyz.789";
std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list;

strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(s,predicate,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

std::list< std::pair< std::string::const_iterator,std::string::const_iterator > >::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::string current_token = std::string((*it).first,(*it).second);
   std::cout << current_token << "\t";
   ++it;
}

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Defines exists for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate);

Adding "true" as the last parameter to the constructor of the tokenizer will enable compressed delimiters mode, by default it is set to false.

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate,true);

Iteration over the tokens is performed as follows:

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int > >::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

A typical std::string can be tokenized in the following manner:

std::string s = "abc|123|xyz|789";
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(s,predicate);
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::string current_token = std::string((*it).first,(*it).second);
   std::cout << current_token << "\t";
   ++it;
}

Another common situation may be tokenizing a sequence of strings, such as the following:

const unsigned int str_list_size = 10;
std::string str_list[str_list_size] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                                        "mno" , "delimiter" , "rst" , "uvw" ,
                                        "delimiter" , "xyz" };

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer(range.begin(),range.end(),predicate);
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> >::iterator it = tokenizer.begin();
while(it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

A Practicle Example

Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your token and a predicate and insert the tokens as strings into your container.

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& s) const
   {
      strtk::split(s,predicate_,strtk::range_to_string_back_inserter(container_),true);
   }

private:
   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the strtk::parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   std::string delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   unsigned int line_count = strtk::for_each_line(file_name,parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main(void)
{
   std::string text_file_name = "text.txt";
   std::deque< std::string > word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. The list definition might something like the following:

const unsigned int list_size = 17;
const std::string not_of_interest_list [list_size] = { "as", "at", "but", "by", "for",
                                                       "in", "like", "next", "of", "on",
                                                       "opposite", "out", "past", "to",
                                                       "up", "via", ""};

Some minor updates to the parse_line predicate include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_string back_inserter to convert the tokens from their range iterator representation to a string representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as an map or set. The following is the improved version of the parse_line predicate:

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(""),
     tokenizer_(tmp_,predicate_,true)
   {}

   inline void operator() (const std::string& s)
   {
      strtk::for_each_token(s,tokenizer_,
                            strtk::filter_on_match< strtk::range_to_string_back_inserter_iterator<Container> >
                            (not_of_interest_list,not_of_interest_list + list_size,
                            strtk::range_to_string_back_inserter_iterator<Container>(container_),
                            true,false));
   }

private:
   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
};

The above described predicate can be greatly simplified by using binders and various lamba expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there is a struct that represents the properties of a person, a person struct if you will:

struct person
{
   std::string name;
   unsigned int age;
   double height
   double weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string, the following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token₀	Delimiter₀	Token₁	Delimiter₁	Token₂	Delimiter₂	Token₃
Name	\|	Age	\|	Height(m)	\|	Weight(kg)

std::string data = "Rumpelstiltskin|97|1.31|58.7";
person p;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(data,predicate);

strtk::parse(tokenizer,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. a predicate is defined that takes a container specialized on the struct, and a predicate from which it will instantiate a tokenizer. This predicate is then instantiated and fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
   parse_person_tuple(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(""),
     tokenizer_(tmp_,predicate_,true)
   {}

   inline void operator() (const std::string& s)
   {
      person p;
      strtk::parse(s,tokenizer_,p.name,p.age,p.height,p.weight);
      container_.push_back(p);
   }

private:
   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
};

Bringing the above pieces together to process a file:

int main(void)
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   typedef strtk::single_delimiter_predicatestring::value_type> sdp_predicate_type;
   typedef parse_person_tuple<std::deque<person>,sdp_predicate_type> predicate_type;
   strtk::for_each_line(file_name,predicate_type(person_list,sdp_predicate_type('|')));
   return 0;
}

Strtk Library Dependency

Originally the library was mostly compatible with C++ compilers ranging from GCC 2.95 and Visual studio 6.0, however due to upgrades it now requires at least GCC 3.4 and Visual Studio 8.0 to compile easily. It also requires the BOOST library for its boost::lexical_cast routine. This dependency can be removed manually.

Conclusion

Like most things there is a trade-off between performance and usability with the above mentioned tokenizer. The original aim was to provide an interface similar to that of the BOOST tokenizer but to also provide some simplifications that will hopefully provide the user more flexibility in the long run, this is by no means a complete replacement but rather a more simpler way of looking at the problem. Tokenizing a string isn't the most interesting problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing the job this week or next month.

License

This article is licensed under Common Public License Version 1.0 (CPL)

About the Author

Arash Partow

I'm a computer scientist by ambition and a software engineer by trade.

Occupation:	Other
Location:	Heard Island And Mcdonald Islands

Other popular Algorithms & Recipes articles:

Symbolic Differentiation
This article demonstrates differentiating expressions using a stack and displaying the input expression and its derivative.
Manipulating colors in .NET - Part 1
Understand and use color models in .NET
C#: A-Star is born
Understand graphs and A* path-finding algorithm with C#
Fast mathematical expressions parser
Code for a fast math parser library.
State of the Art Expression Evaluation
A tutorial on how to realize an expression evaluator in CSharp with ANTLR

Article Top

You must Sign In to use this message board.

FAQ	Noise Tolerance	Search comments
	Layout	Per page

Msgs 1 to 4 of 4 (Total in Forum: 4) (Refresh)

FirstPrevNext

Subject

Author

Date

I must be dense ... this isnt for beginners

Garth J Lancaster

16:34 25 Jan '08

Re: I must be dense ... this isnt for beginners [modified]

Arash Partow

19:28 25 Jan '08

Garth J Lancaster wrote:
happy Australia Day out there in the Heard/Macquarie Islands !!

Thanx and the same to you, I have get back to the mainland or else I'll miss out on the national BBQ.

Garth J Lancaster wrote:
1) it seems complex - possibly more complex than boost

I would generally think it has the same level of complexity. The stk_tokenizer_cmp example shows that each implementation requires 6 lines to provide a readable instantiation and use. I guess the article could have shown more of the simple instantiation usage patterns.

Garth J Lancaster wrote:
2) It doesnt indicate that one needs boost to compile it - yet it depends on boost::lexical_cast

yes that is true, I'll update the article with that, though its important to understand that the underlying tokenizer or split do not require BOOST per-say just some auxiliary helper functions for converting types. I thought about writing my own and removing the BOOST dependency but in this case thought having BOOST around is better. One could remove/comment out the lexical_cast include and the methods that use it and be able to compile pretty much all the examples barring a couple with no problems.

Garth J Lancaster wrote:
3) It wasnt immediately obvious how to switch between compressed/non-compressed delimeters - I see in your code that there's a replace_consecutive option/construct so Im guessing this is it...

That's a good point, I'll update the article with some more information about that, but simply put its done by including a "true" as the last parameter in the constructor of the tokenizer or in the call of the split.

Garth J Lancaster wrote:
4) I was confused in one way between split and tokenize - one of your examples - tokenizer_example11 actually had what I thought I would have been able to do with split....

Because of the way both the tokenizer and split present tokens, I found that for better integration with STL algorithms like copy and such a back_inserter was required. Your normal usage pattern is as you mentioned described in tokenizer_example11, and yes split can be used just as easily like so:

std::string s = "abc|123|xyz|789";
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
std::vector<std::string> token_list;
strtk::split(s,predicate,strtk::range_to_string_back_inserter(token_list));

if (required_number_of_tokens == token_list.size())
{
   do something...
}

btw thanx for the 5.

Re: I must be dense ... this isnt for beginners

Garth J Lancaster

20:26 25 Jan '08

Re: I must be dense ... this isnt for beginners

User of Users Group

1:21 26 Jan '08

Last Visit: Tuesday, January 29, 2008 9:00:35 AM

General News Question Answer Joke Admin