niXforums Forum Index
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
· nixdoc.net · man pages · Linux HOWTOs · FreeBSD Tips · Forums
navigation Forum index » Programming » C++
Tokenizer Function (plus rant on strtok documentation)
Post new topic   Reply to topic Page 1 of 2 [19 Posts] View previous topic :: View next topic
Goto page:  1, 2 Next
Author Message
Roland Pibinger
*nix forums Guru Wannabe


Joined: 31 Mar 2005
Posts: 262

PostPosted: Thu Jul 13, 2006 5:22 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

On 12 Jul 2006 22:35:19 -0700, "Mehturt@gmail.com" <Mehturt@gmail.com>
wrote:
Quote:
I have also found this impl is not usable with set, you only mentioned
it does not work with map. Is that intentional?

Let me add background information to the above implementation. The
performance characteristics for copying a std::string (the
std::basic_string template) are not specified by the C++ Standard.
Copying may be expensive ('deep') or cheap because of an optimization
(ref-counting or small-string optimization) - implementation specific.
The above code tries to avoid copying (and assignment) of non-empty
strings (assuming that copying an empty string is always cheap for
reasonable basic_sting implementations). As shown in the mentioned
thread with John Potter's original code an optimized version of the
tokenize function can considerably reduce dynamic memory allocation
for some string implementations and some usage scenarios.
I realize now that it's probably not a good idea to make the container
type a template parameter.That kind of optimization cannot be applied
to all Standard containers in the same way. Elements in a set e.g. are
sorted after insert and immutable (otherwise the sort order would be
compromised). Separate non-template implementations for each container
and each string type are preferable.

Best regards,
Roland Pibinger
Back to top
Mehturt@gmail.com
*nix forums beginner


Joined: 11 Oct 2005
Posts: 11

PostPosted: Thu Jul 13, 2006 5:35 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

Roland Pibinger wrote:
Quote:
On 11 Jul 2006 22:58:29 -0700, "Mehturt@gmail.com" <Mehturt@gmail.com
wrote:
Roland Pibinger wrote:
The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm

template <typename StringT, typename ContainerT
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));

I think e can be StringT::npos here and that would cause problems in
the line below..

StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}
return num;
}

You are right. The while loop should rather be:

while (b != StringT::npos) {
typename StringT::size_type
e(std::min (text.find_first_of(delim, b), text.size()));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}

Thank you,
Roland Pibinger

I have also found this impl is not usable with set, you only mentioned
it does not work with map..
Is that intentional?
Back to top
Roland Pibinger
*nix forums Guru Wannabe


Joined: 31 Mar 2005
Posts: 262

PostPosted: Wed Jul 12, 2006 7:08 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

On 11 Jul 2006 22:58:29 -0700, "Mehturt@gmail.com" <Mehturt@gmail.com>
wrote:
Quote:
Roland Pibinger wrote:
The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm

template <typename StringT, typename ContainerT
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));

I think e can be StringT::npos here and that would cause problems in
the line below..

StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}
return num;
}

You are right. The while loop should rather be:

while (b != StringT::npos) {
typename StringT::size_type
e(std::min (text.find_first_of(delim, b), text.size()));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}

Thank you,
Roland Pibinger
Back to top
Ron Natalie
*nix forums Guru


Joined: 21 Feb 2005
Posts: 461

PostPosted: Wed Jul 12, 2006 11:44 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

jmoy wrote:
Quote:

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

Strtok should have never been allowed to enter the standard. It is

an abysmal function an a relic from the days when C programmers
were pretty lousy software engineers (it's even worse than the
stupid-assed "portable" I/O library that should have been allowed
to be made into the stdio.
Back to top
Mehturt@gmail.com
*nix forums beginner


Joined: 11 Oct 2005
Posts: 11

PostPosted: Wed Jul 12, 2006 5:58 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

Roland Pibinger wrote:
Quote:
On Tue, 11 Jul 2006 07:06:16 GMT, "Robbie Hatley"
bogus.address@no.spam> wrote:
So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm

template <typename StringT, typename ContainerT
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));

I think e can be StringT::npos here and that would cause problems in
the line below..

Quote:
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}
return num;
}

For std::vector as result container efficency can be increased with
reserve().

Best wishes,
Roland Pibinger
Back to top
Roland Pibinger
*nix forums Guru Wannabe


Joined: 31 Mar 2005
Posts: 262

PostPosted: Tue Jul 11, 2006 6:56 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

On Tue, 11 Jul 2006 07:06:16 GMT, "Robbie Hatley"
<bogus.address@no.spam> wrote:
Quote:
So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm>

template <typename StringT, typename ContainerT>
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}
return num;
}

For std::vector as result container efficency can be increased with
reserve().

Best wishes,
Roland Pibinger
Back to top
roberts.noah@gmail.com
*nix forums Guru


Joined: 11 Sep 2005
Posts: 795

PostPosted: Tue Jul 11, 2006 4:27 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

Robbie Hatley wrote:
Quote:
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.

Every once in a while I also have this massocistic urge to torture
myself for no reason. Eventually though I tire of it and move on.

The stroke function is useless. It is insecure, works in insecure
types, and is a general pita to use. There are better options that are
much easier to use and have type safety. Look into string streams as a
much better alternative to stroke.
Back to top
Jerry Coffin
*nix forums Guru


Joined: 21 Feb 2005
Posts: 406

PostPosted: Tue Jul 11, 2006 3:30 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

In article <1152622377.108600.157680@m79g2000cwm.googlegroups.com>,
jmoy.matecon@gmail.com says...

[ ... ]

Quote:
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz end=0;
for(;Wink{
Sz begin=str.find_first_not_of(delim,end);
if (begin==string::npos)
break;
end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
}
}

I think I'd also make the character type a template parameter:

template class<charT, class OIter>
void tokenize( basic_string<charT> input,
basic_string<charT> delim,
OIter oi)
{
typedef basic_string<charT> str;
typedef str::size_type Sz;

Sz end = 0;
for (;Wink {
Sz begin = input.find_first_not_of(delim, end);
if (str::npos == begin)
break;
end = input.find_first_of(delim, begin);
*oi++ = input.substr(begin, end-begin);
}
}

This way, the code can work with strings of either narrow or wide
characters.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Back to top
jmoy.matecon@gmail.com
*nix forums beginner


Joined: 30 Jun 2006
Posts: 5

PostPosted: Tue Jul 11, 2006 12:52 pm    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

Robbie Hatley wrote:
Quote:
"jmoy" <jmoy.matecon@gmail.com> wrote:


#include <string
using namespace std;

template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{

typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

...
Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.

No. As mentioned by other posters, if you are adding tokens to a
container the right thing is to call the function with something like a
back_inserter in which case there is no memory corruption

Quote:
2. You don't take into account the fact that the string might START
with one or more delimiters.

You are right. My mistake.

Quote:

Maybe something like THIS might be better:

#include <string
// using namespace std; // Ewww.
template <class Container
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

The problem with this is that it fails for the reverse case of a string
ending with a delimiter. Also, I don't like the idea of tying
algorithms with containers---iterators are a much more general concept.
Here is a corrected version of my function:

template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz end=0;
for(;Wink{
Sz begin=str.find_first_not_of(delim,end);
if (begin==string::npos)
break;
end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
}
}
Back to top
Alex Vinokur
*nix forums Guru Wannabe


Joined: 23 Feb 2005
Posts: 160

PostPosted: Tue Jul 11, 2006 8:51 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

Alex Vinokur wrote:
[slip]
Quote:
Also "Splitting string into vector of vectors":
http://groups.google.com/group/sources/msg/77993fb8841382c8
http://groups.google.com/group/perfo/msg/9d49a1be3a5c6335
--------------------------------------------------

Instead of
Quote:
http://groups.google.com/group/perfo/msg/8273f4d1a05cfbd1
should be

http://groups.google.com/group/perfo/msg/f3c775cf7e3cdcf0
Sorry
--------------------------------------------------
[snip]

Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn
Back to top
Alex Vinokur
*nix forums Guru Wannabe


Joined: 23 Feb 2005
Posts: 160

PostPosted: Tue Jul 11, 2006 8:32 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

[snip]
sonison.james@gmail.com wrote:
Quote:
check out http://www.boost.org/libs/tokenizer/index.html
[snip]


Also "Splitting string into vector of vectors":
http://groups.google.com/group/sources/msg/77993fb8841382c8
http://groups.google.com/group/perfo/msg/9d49a1be3a5c6335
http://groups.google.com/group/perfo/msg/8273f4d1a05cfbd1

Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn
Back to top
Jerry Coffin
*nix forums Guru


Joined: 21 Feb 2005
Posts: 406

PostPosted: Tue Jul 11, 2006 7:24 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

In article <I5Isg.129443$dW3.1302@newssvr21.news.prodigy.com>,
bogus.address@no.spam says...
Quote:
"Jerry Coffin" <jcoffin@taeus.com> wrote:

IMO, this is a poor idea. Take an iterator for the output.

Puts extreme burden on the user to provide the right kind of
container and iterator.

IMO, it's not extreme at all. They're going to have to provide the
right kind of container in any case -- but the code you provided will
often _prevent_ them from using the right container. Just for
example, putting the output into a set might well make sense -- but
your code simply won't work with it at all.

[ ... ]

Quote:
I can see use for both, actually. But the iterator version will
always be the more dangerous one.

The iterator version is the only one that really works. In any case,
for a programmer to become at all proficient in using C++, they need
to learn how to do this anyway -- look through most of the algorithms
in the standard library, and note that they also take an iterator to
tell them where to put the results -- with precisely the same result.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Back to top
Robbie Hatley
*nix forums Guru Wannabe


Joined: 12 Jun 2005
Posts: 174

PostPosted: Tue Jul 11, 2006 7:06 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

"Jerry Coffin" <jcoffin@taeus.com> wrote:

Quote:
IMO, this is a poor idea. Take an iterator for the output.

Puts extreme burden on the user to provide the right kind of
container and iterator. Such a function would often get mis-used
and cause memory corruption and program crashes.

Quote:
If the user wants the data pushed onto the back, they can
use back_inserter to get that.

If they know any better.

Quote:
If they want it inserted into something like a set, they
can use inserter to get that.

If they know that they should, and if they know how.

So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

I can see use for both, actually. But the iterator version will
always be the more dangerous one.

--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/
Back to top
Jerry Coffin
*nix forums Guru


Joined: 21 Feb 2005
Posts: 406

PostPosted: Tue Jul 11, 2006 6:37 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

In article <zBHsg.129441$dW3.67625@newssvr21.news.prodigy.com>,
bogus.address@no.spam says...

[ ... ]

Quote:
Maybe something like THIS might be better:

#include <string
// using namespace std; // Ewww.
template <class Container
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

IMO, this is a poor idea. Take an iterator for the output. If the
user wants the data pushed onto the back, they can use back_inserter
to get that. If they want it inserted into something like a set, they
can use inserter to get that.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Back to top
Robbie Hatley
*nix forums Guru Wannabe


Joined: 12 Jun 2005
Posts: 174

PostPosted: Tue Jul 11, 2006 6:31 am    Post subject: Re: Tokenizer Function (plus rant on strtok documentation) Reply with quote

"jmoy" <jmoy.matecon@gmail.com> wrote:

Quote:
strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

Ah, sort of like the code my ex-boss left me to maintain after he
got fired. Hundreds of global variables, which he uses to pass
data from function to function, like a dumbass. Of course, since
the program is a complex windows app with timers and interrupts,
the data often gets over-written on its way from one place to
another. ::sigh:: Global variables are the work of Sauron.

Quote:
I guess tying the tokenizer to vector<string> is not a good idea.

It does limit the user to a std::vector<std::string>, yes. However,
that construct is pretty good for this app. I find it hard to
think of cases which couldn't use that to hold a bunch of tokens.

Quote:
If it took an output iterator it could be used with any container
or even with things like ostream_iterators.

Provided that the output container was big enough. If you start
with an empty conainer and try writing to it using output
iterators, you'll get an "illegal memory access" or "general
protection fault" or some such thing. So you'd have to make sure
that the container was huge. I don't like that approach.

Quote:
#include <string
using namespace std;
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.
2. You don't take into account the fact that the string might START
with one or more delimiters.

Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

I haven't tested that, but I think something like that would work
better. It does require that the container for the tokens have
the push_back() method defined. Other than that, it's pretty
generic.

Note that to take care of the "starts with delimiters" case,
I simply moved your "first_not_of" up to the top of the loop.
That should work nicely.


--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 2 [19 Posts] Goto page:  1, 2 Next
View previous topic :: View next topic
The time now is Thu Nov 27, 2014 9:33 pm | All times are GMT
navigation Forum index » Programming » C++
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts Function Pointer Sikandar C 3 Fri Jul 21, 2006 1:23 pm
No new posts Arbitrary function with parameter darknails@gmail.com C++ 2 Fri Jul 21, 2006 9:58 am
No new posts Is there C/C++ corresponding function in Linux for Java's... xiebopublic@gmail.com apps 4 Fri Jul 21, 2006 3:22 am
No new posts Is there C/C++ corresponding function in Linux for Java's... xiebopublic@gmail.com C++ 1 Fri Jul 21, 2006 2:44 am
No new posts Tokenizer Bit byte C++ 2 Fri Jul 21, 2006 2:40 am

Copyright © 2004-2005 DeniX Solutions SRL
Other DeniX Solutions sites: Unix/Linux blog |  electronics forum |  medicine forum |  science forum |  email marketing service
 
Privacy Policy
[ Time: 0.0522s ][ Queries: 16 (0.0065s) ][ GZIP on - Debug on ]