Returning matched subexpressions

The REFind and REFindNoCase functions return the location in the search string of the first match of the regular expression. Even though the search string in the next example contains two matches of the regular expression, the function only returns the index of the first:

<cfset IndexOfOccurrence=REFind(" BIG ", "Some BIG BIG string")>
<!--- The value of IndexOfOccurrence is 5 --->

To find all instances of the regular expression, you must call the REFind and REFindNoCase functions multiple times.

Both the REFind and REFindNoCase functions take an optional third parameter that specifies the starting index in the search string for the search. By default, the starting location is index 1, the beginning of the string.

To find the second instance of the regular expression in this example, you call REFind with a starting index of 8:

<cfset IndexOfOccurrence=REFind(" BIG ", "Some BIG BIG string", 8)>
<!--- The value of IndexOfOccurrence is 9 --->

In this case, the function returns an index of 9, the starting index of the second string " BIG ".

To find the second occurrence of the string, you must know that the first string occurred at index 5 and that the string's length was 5. However, REFind only returns starting index of the string, not its length. So, you either must know the length of the matched string to call REFind the second time, or you must use subexpressions in the regular expression.

The REFind and REFindNoCase functions let you get information about matched subexpressions. If you set these functions' fourth parameter, ReturnSubExpression, to True, the functions return a CFML structure with two arrays, pos and len, containing the positions and lengths of text strings that match the subexpressions of a regular expression, as the following example shows:

<cfset sLenPos=REFind(" BIG ", "Some BIG BIG string", 1, "True")>
<cfoutput>
   <cfdump var="#sLenPos#">
</cfoutput><br>

The following figure shows the output of the cfdump tag:


Output of the cfdump tag

Element one of the pos array contains the starting index in the search string of the string that matched the regular expression. Element one of the len array contains length of the matched string. For this example, the index of the first " BIG " string is 5 and its length is also 5. If there are no occurrences of the regular expression, the pos and len arrays each contain one element with a value of 0.

You can use the returned information with other string functions, such as mid. The following example returns that part of the search string matching the regular expression:

<cfset myString="Some BIG BIG string">
<cfset sLenPos=REFind(" BIG ", myString, 1, "True")>
<cfoutput>
   #mid(myString, sLenPos.pos[1], sLenPos.len[1])#
</cfoutput>

Each additional element in the pos array contains the position of the first match of each subexpression in the search string. Each additional element in len contains the length of the subexpression's match.

In the previous example, the regular expression " BIG " contained no subexpressions. Therefore, each array in the structure returned by REFind contains a single element.

After executing the previous example, you can call REFind a second time to find the second occurrence of the regular expression. This time, you use the information returned by the first call to make the second:

<cfset newstart = sLenPos.pos[1] + sLenPos.len[1] - 1>
<!--- subtract 1 because you need to start at the first space --->
<cfset sLenPos2=REFind(" BIG ", "Some BIG BIG string", newstart, "True")>
<cfoutput>
   <cfdump var="#sLenPos2#">
</cfoutput><br>

The following figure shows the output of the cfdump tag:


Output of the cfdump tag

If you include subexpressions in your regular expression, each element of pos and len after element one contains the position and length of the first occurrence of each subexpression in the search string.

In the following example, the expression [A-Za-z]+ is a subexpression of a regular expression. The first match for the expression ([A-Za-z]+)[ ]+, is "is is".

<cfset sLenPos=REFind("([A-Za-z]+)[ ]+\1",
"There is is a cat in in the kitchen", 1, "True")>
<cfoutput> <cfdump var="#sLenPos#"> </cfoutput><br>

The following figure shows the output of the cfdump tag:


Output of the cfdump tag

The entries sLenPos.pos[1] and sLenPos.len[1] contain information about the match of the entire regular expression. The array elements sLenPos.pos[2] and sLenPos.len[2] contain information about the first subexpression ("is"). Because REFind returns information on the first regular expression match only, the sLenPos structure does not contain information about the second match to the regular expression, "in in".

The regular expression in the following example uses two subexpressions. Therefore, each array in the output structure contains the position and length of the first match of the entire regular expression, the first match of the first subexpression, and the first match of the second subexpression.

<cfset sString = "apples and pears, apples and pears, apples and pears">
<cfset regex = "(apples) and (pears)">
<cfset sLenPos = REFind(regex, sString, 1, "True")>
<cfoutput>
   <cfdump var="#sLenPos#">
</cfoutput><br><br>

The following figure shows the output of the cfdump tag:


Output of the cfdump tag

For a full discussion of subexpression usage, see the sections on REFind and REFindNoCase in the ColdFusion functions chapter in CFML Reference.

View comments on LiveDocs