IDL

STREGEX

STREGEX

The STREGEX function performs regular expression matching against the strings contained in StringExpression. STREGEX can perform either a simple boolean True/False evaluation of whether a match occurred, or it can return the position and offset within the strings for each match. The regular expressions accepted by this routine, which correspond to “Posix Extended Regular Expressions”, are similar to those used by such UNIX tools as egrep, lex, awk, and Perl.

For more information about regular expressions, see Learning About Regular Expressions.

STREGEX is based on the regex package written by Henry Spencer, modified by Exelis VIS only to the extent required to integrate it into IDL. This package is freely available at: www.arglist.com/regex

Examples


See Additional Examples for more information on using STREGEX.

Example 1

To match a string starting with an “a”, followed by a “b”, followed by 1 or more “c”:

pos = STREGEX('aaabccc', 'abc+', length=len)
PRINT, STRMID('aaabccc', pos, len)

IDL Prints:

abccc

To perform the same match, and also find the locations of the three parts:

pos = STREGEX('aaabccc', '(a)(b)(c+)', length=len, /SUBEXPR)
print, STRMID('aaabccc', pos, len)

IDL Prints:

abccc a b ccc

Or more simply:

print,STREGEX('aaabccc','(a)(b)(c+)',/SUBEXPR,/EXTRACT)

IDL Prints:

abccc a b ccc

Syntax


Result = STREGEX( StringExpression, RegularExpression [, /BOOLEAN | , /EXTRACT | , LENGTH=variable [, /SUBEXPR]] [, /FOLD_CASE] )

Return Value


By default, STREGEX returns the position of the matched string within each element of StringExpression. If no match is found, -1 is returned. Optionally, STREGEX can return a boolean True/False result of the match or the matched strings.

Arguments


StringExpression

A string or string array in which to search for matches of RegularExpression.

RegularExpression

A scalar string containing the regular expression to match. See Learning About Regular Expressions for a description of the meta characters that can be used in a regular expression.

Keywords


BOOLEAN

Normally, STREGEX returns the position of the first character in each element of StringExpression that matches RegularExpression. Setting BOOLEAN modifies this behavior to simply return a True/False value indicating if a match occurred or not.

EXTRACT

Normally, STREGEX returns the position of the first character in each element of StringExpression that matches RegularExpression. Setting EXTRACT modifies this behavior to simply return the matched substrings. The EXTRACT keyword cannot be used with either BOOLEAN or LENGTH.

FOLD_CASE

Regular expression matching is normally a case-sensitive operation. Set FOLD_CASE to perform case-insensitive matching instead.

LENGTH

Set this keyword equal to a named variable that will contain the length of each matching string found. If no match is found in an element of StringExpression, the returned variable will contain -1 for that element. Together with this result of this function, which contains the starting points of the matches in StringExpression, LENGTH can be used with the STRMID function to extract the matched substrings. The LENGTH keyword cannot be used with either BOOLEAN or EXTRACT.

SUBEXPR

By default, STREGEX only reports the overall match. Setting SUBEXPR causes it to report the overall match as well as any subexpression matches. A subexpression is any part of a regular expression written within parentheses. For example, the regular expression ‘(a)(b)(c+)’ has 3 subexpressions, whereas the functionally equivalent 'abc+' has none. The SUBEXPR keyword cannot be used with BOOLEAN.

If a subexpression participated in the match several times, the reported substring is the last one it matched. Note, as an example in particular, that when the regular expression ‘(b*)+’ matches ‘bbb’, the parenthesized subexpression matches the three 'b's and then an infinite number of empty strings following the last ‘b’, so the reported substring is one of the empties. This occurs because the ‘*’ matches zero or more instances of the character that precedes it.

In order to return multiple positions and lengths for each input, the result from SUBEXPR has a new first dimension added compared to StringExpression.

Additional Examples


Example 2

This example searches a string array for words of any length beginning with “f” and ending with “t” without the letter “o” in between:

str = ['foot', 'Feet', 'fate', 'FAST', 'ferret', 'affluent']
PRINT, STREGEX(str, '^f[^o]*t$', /EXTRACT, /FOLD_CASE)

This statement results in:

Feet FAST ferret

Note the following about this example:

  • Unlike the * wildcard character used by STRMATCH, the * meta character used by STREGEX applies to the item directly on its left, which in this case is [^o], meaning “any character except the letter ‘o’ ”. Therefore, [^o]* means “zero or more characters that are not ‘o’ ”, whereas the following statement would find only words whose second character is not “o”:
  PRINT, str[WHERE(STRMATCH(str, 'f[!o]*t', /FOLD_CASE) EQ 1)]
  • The anchors (^ and $) tell STREGEX to find only words that begin with “f” and end with “t”. If we left out the ^ anchor in the above example, STREGEX would also return “ffluent” (a substring of “affluent”). Similarly, if we left out the $ anchor, STREGEX would also return “fat” (a substring of “fate”).

Version History


5.3

Introduced

See Also


String Operations, String Processing, STRCMP, STRJOIN, STRMATCH, STRMID, STRPOS, STRSPLIT



Notes


Eduardo Iturrate
Several examples with Regular Expressions:

;Define string variable
string = '<H1>Lorem ipsum</H1> dolor sit <h1>amet</h1>, consectetur <a href="http://www.idldatapoint.com">adipisicing</a> elit, sed do <a href="mailto:me@myserver.com" target="_blank" rel="nofollow">me@myserver.com</a>eiusmod...'

;Extract email address
print, stregex(string, "[-_.[:alnum:]]+@[-_.[:alnum:]]+\.[[:alnum:]]+", /EXTRACT)
;IDL Returns: me@myserver.com

;Extract URL link
print, stregex(string, "href=[""'][-_.:=/ [:alnum:]]+[""']", /EXTRACT)
;IDL Returns: href="http://www.idldatapoint.com"

;First HTML tag
print, stregex(string, "<[^>]*>", /EXTRACT)
;IDL Returns: <H1>

;Pair of IDL tags and text in between
print, stregex(string, "<[^>]+>[^<]*</[^>]+>", /EXTRACT)
;IDL Returns: <H1>Lorem ipsum</H1>
14th September 2012 1:11pm
Page 1 of 1  (1 comment)


This information is not subject to the controls of the International Traffic in Arms Regulations (ITAR) or the Export Administration Regulations (EAR). However, it may be restricted from transfer to various embargoed countries under U.S. laws and regulations.
© 2014 Exelis Visual Information Solutions