4,811,124 members and growing! (5,720 online)
Email Password   helpLost your password?
General Programming » Algorithms & Recipes » Algorithms     Beginner License: Common Public License Version 1.0 (CPL)

C++ Strtk Tokenizer

By Arash Partow

A brief introduction to a tokenizer implementation in C++
C++ (VC7.1, VC8.0), C++, C++/CLI, C, Dev

Posted: 25 Jan 2008
Updated: 29 Jan 2008
Views: 2,007
Announcements


Search    
Advanced Search
Sitemap
print Print Broken Article? Broken Article? Discuss Discuss Recommend Article Send to a friend
9 votes for this Article.
Popularity: 4.65 Rating: 4.87 out of 5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences. The indicies in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or an map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the most likely candidate would be the implementation in the BOOST library. Proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, genericity of design, performance.

1. Over-all usage patterns

Here we are mainly concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should or there be such a thing as an empty token? and what do preceding and trailing delimiters mean?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub -string in question from the original sequence, then returned back (some of this may be alleviated by RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range (BOOST split already has this feature) and allow the caller to decide how they wish to inspect the token. This minor change in interface design provides a great deal of flexibility and performance gain.

3. Genericity of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance

Tokenizing of strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process.

Getting started

StrTk provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

InputToken List
aa
a|ba,b
a||ba,<>,b
|a<>,a
a|a,<>
|a||b<>,a,<>,b
||a||b||<>,<>,a,<>,b,<>,<>
|<>,<>
||<>,<>,<>
|||<>,<>,<>,<>

Compressed Delimiters

InputToken List
aa
a||ba,b
|a||b<>,a,b
||a||b||<>,a,b,<>
|<>,<>
||<>,<>
|||<>,<>

Delimiters

Two forms of delimiters are supported and they are single and multiple delimiters SDP and MDP respectively. Essentially a SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single delimiter predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicatestring::value_type> predicate('|');

Multiple delimiter predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

unsigned int delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> predicate(delimiters,delimiters + 5);

Multiple char delimiter predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

std::string s = "abc|123|xyz|789";
std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(s,predicate,std::back_inserter(token_list));

Adding "true" as the last parameter to the function call of strtk::split will enable compressed delimiters mode, by default it is set to false.

strtk::split(s,predicate,std::back_inserter(token_list),true);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string s = "abc?123;xyz.789";
std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list;

strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(s,predicate,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

std::list< std::pair< std::string::const_iterator,std::string::const_iterator > >::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::string current_token = std::string((*it).first,(*it).second);
   std::cout << current_token << "\t";
   ++it;
}

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Defines exists for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate);

Adding "true" as the last parameter to the constructor of the tokenizer will enable compressed delimiters mode, by default it is set to false.

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate,true);

Iteration over the tokens is performed as follows:

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int > >::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

A typical std::string can be tokenized in the following manner:

std::string s = "abc|123|xyz|789";
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(s,predicate);
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::string current_token = std::string((*it).first,(*it).second);
   std::cout << current_token << "\t";
   ++it;
}

Another common situation may be tokenizing a sequence of strings, such as the following:

const unsigned int str_list_size = 10;
std::string str_list[str_list_size] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                                        "mno" , "delimiter" , "rst" , "uvw" ,
                                        "delimiter" , "xyz" };

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer(range.begin(),range.end(),predicate);
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> >::iterator it = tokenizer.begin();
while(it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

A Practicle Example

Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your token and a predicate and insert the tokens as strings into your container.

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& s) const
   {
      strtk::split(s,predicate_,strtk::range_to_string_back_inserter(container_),true);
   }

private:
   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the strtk::parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   std::string delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   unsigned int line_count = strtk::for_each_line(file_name,parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main(void)
{
   std::string text_file_name = "text.txt";
   std::deque< std::string > word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. The list definition might something like the following:

const unsigned int list_size = 17;
const std::string not_of_interest_list [list_size] = { "as", "at", "but", "by", "for",
                                                       "in", "like", "next", "of", "on",
                                                       "opposite", "out", "past", "to",
                                                       "up", "via", ""};

Some minor updates to the parse_line predicate include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_string back_inserter to convert the tokens from their range iterator representation to a string representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as an map or set. The following is the improved version of the parse_line predicate:

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(""),
     tokenizer_(tmp_,predicate_,true)
   {}

   inline void operator() (const std::string& s)
   {
      strtk::for_each_token(s,tokenizer_,
                            strtk::filter_on_match< strtk::range_to_string_back_inserter_iterator<Container> >
                            (not_of_interest_list,not_of_interest_list + list_size,
                            strtk::range_to_string_back_inserter_iterator<Container>(container_),
                            true,false));
   }

private:
   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
};

The above described predicate can be greatly simplified by using binders and various lamba expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there is a struct that represents the properties of a person, a person struct if you will:

struct person
{
   std::string name;
   unsigned int age;
   double height
   double weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string, the following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token0Delimiter0Token1Delimiter1Token2Delimiter2Token3
Name|Age|Height(m)|Weight(kg)
std::string data = "Rumpelstiltskin|97|1.31|58.7";
person p;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(data,predicate);

strtk::parse(tokenizer,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. a predicate is defined that takes a container specialized on the struct, and a predicate from which it will instantiate a tokenizer. This predicate is then instantiated and fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
   parse_person_tuple(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(""),
     tokenizer_(tmp_,predicate_,true)
   {}

   inline void operator() (const std::string& s)
   {
      person p;
      strtk::parse(s,tokenizer_,p.name,p.age,p.height,p.weight);
      container_.push_back(p);
   }

private:
   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
};

Bringing the above pieces together to process a file:

int main(void)
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   typedef strtk::single_delimiter_predicatestring::value_type> sdp_predicate_type;
   typedef parse_person_tuple<std::deque<person>,sdp_predicate_type> predicate_type;
   strtk::for_each_line(file_name,predicate_type(person_list,sdp_predicate_type('|')));
   return 0;
}

Strtk Library Dependency

Originally the library was mostly compatible with C++ compilers ranging from GCC 2.95 and Visual studio 6.0, however due to upgrades it now requires at least GCC 3.4 and Visual Studio 8.0 to compile easily. It also requires the BOOST library for its boost::lexical_cast routine. This dependency can be removed manually.

Conclusion

Like most things there is a trade-off between performance and usability with the above mentioned tokenizer. The original aim was to provide an interface similar to that of the BOOST tokenizer but to also provide some simplifications that will hopefully provide the user more flexibility in the long run, this is by no means a complete replacement but rather a more simpler way of looking at the problem. Tokenizing a string isn't the most interesting problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing the job this week or next month.

License

This article is licensed under Common Public License Version 1.0 (CPL)

About the Author

Arash Partow


I'm a computer scientist by ambition and a software engineer by trade.

Occupation: Other
Location: Heard Island And Mcdonald Islands Heard Island And Mcdonald Islands

Other popular Algorithms & Recipes articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
 FAQ Noise Tolerance Search comments 
 Layout  Per page   
 Msgs 1 to 4 of 4 (Total in Forum: 4) (Refresh)FirstPrevNext
Subject  Author Date 
I must be dense ... this isnt for beginnersGarth J Lancaster16:34 25 Jan '08  
Re: I must be dense ... this isnt for beginners [modified]Arash Partow19:28 25 Jan '08  
Re: I must be dense ... this isnt for beginnersGarth J Lancaster20:26 25 Jan '08  
Re: I must be dense ... this isnt for beginnersUser of Users Group1:21 26 Jan '08  

General    News    Question    Answer    Joke    Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 29 Jan 2008
Editor: Chris Maunder
Copyright 2008 by Arash Partow
Everything else Copyright © CodeProject, 1999-2008
Web10 | Advertise on the Code Project