N S

Novick Software Management • Design • Programming • Training • Consulting

   

 

 

SQL Server T-SQL User-Defined Function of the Week

CHARINDEX() from the Back of the String

Volume 1 Number 10   January 21, 2003
This article has been revised on June 16th, 2003.
If you copied the function udf_Txt_CharIndexRev from
an earlier issue, or if you got it in the newsletter,
please update you code with this newer version. I 
apologize for any inconvenience that I may have caused.

Sign up for this newsletter at:
http://www.novicksoftware.com/UdfOfWeek/UDFofWeekSignup.htm

Before we get to the UDF of the week, I'd like to alert you to 
two new UDF resources on the web.  The first is a presentation 
that I did for the New England SQL Server User Group in Waltham, 
MA.  You'll find the slides and the data at: 
http://www.NovickSoftware.com/Presentations.htm

The second resource is a new web article that I wrote about using
Inline UDFs for managing web page data retrieval.  You'll find
it at: http://www.15seconds.com/issue/030113.htm

I like 15seconds. They're one of the web sites that creates
new content rather than just republishing articles that
were created elsewhere.

This week's UDF is a useful text handling function
udf_Txt_CharIndexRev.  It works just like the built in CHARINDEX
function except that it works from the back of the string.  
Here's the CREATE FUNCTION script:

/------------- Copy From Below this line ----------------------\
CREATE FUNCTION dbo.udf_Txt_CharIndexRev (

    @SearchFor varchar(255) -- Sequence to be found
  , @SearchIn varchar(8000) -- The string to be searched
)  RETURNS int -- Position from the back of the string where
               -- @SearchFor is found in @SearchIn
    WITH SCHEMABINDING
/*
* Searches for a string in another string working from the back.
* It reports the position (relative to the front) of the first 
* such expression it finds. If the expression is not found, it 
* returns zero.
*
* Equivalent Template: 
       CASE 
         WHEN CHARINDEX(
                      , ) > 0
         THEN LEN() 
              - CHARINDEX(REVERSE()
                  , REVERSE ()) + 1
         ELSE 0 END  
* Example:
select dbo.udf_Txt_CharIndexRev('\', 'C:\temp\abcd.txt')
*
* Test:
PRINT 'Test 1    ' + CASE WHEN 8=
           dbo.udf_Txt_CharIndexRev ('\', 'C:\temp\abcd.txt')
                     THEN 'Worked' ELSE 'ERROR' END
PRINT 'Test 2    ' + CASE WHEN 0=
           dbo.udf_Txt_CharIndexRev ('*', 'C:\tmp\d.txt')
                     THEN 'Worked' ELSE 'ERROR' END
--                                     12345678901234567890
PRINT 'Test 3    ' + CASE WHEN 18=
       dbo.udf_Txt_CharIndexRev('fj', 'f123 asdasfdfdfddfjas  ')
                     THEN 'Worked' ELSE 'ERROR' END
*
* History
* WHEN       WHO        WHAT
* ---------- ---------- -----------------------------------------
* 2002-12-10 ASN        Initial Coding
* 2003-06-16 ASN        Fixed problem with Test 3 by using 
                        DATALENGTH instead of LEN
****************************************************************/
AS BEGIN

    DECLARE @Result int
          , @StringLen int
          , @ReverseIn varchar(8000)
          , @ReverseFor varchar(255) 

    SELECT @ReverseIn = REVERSE (@SearchIn)
         , @ReverseFor = REVERSE(@SearchFor)
         , @StringLen = DATALENGTH(@SearchIn)

    SELECT @Result = CHARINDEX(@ReverseFor, @ReverseIn)
    
    -- return the position from the front of the string                          
    IF @Result > 0 
        SET @Result = @StringLen - (@Result 
                                  + DATALENGTH(@SearchFor) 
                                  - 2
                                   )
    -- ENDIF

    RETURN @Result
END
GO
GRANT EXEC on dbo.udf_Txt_CharIndexRev to PUBLIC
GO
\------------ Stop copying above this line --------------------/

Let's try out a few simple cases with this test query:
/------------- Copy From Below this line ----------------------\
select dbo.udf_Txt_CharIndexRev('fdf', 'f123 asdasfdfdfddfjas  ')
            as [Middle]
     , dbo.udf_Txt_CharIndexRev('C', 'C:\temp\ab.txt') as [start]
     , dbo.udf_Txt_CharIndexRev('X', '123456789X') as [end]
     , dbo.udf_Txt_CharIndexRev('AB', '12347') as [missing]
GO
\------------ Stop copying above this line --------------------/
(Results)
Middle      start       end         missing     
----------- ----------- ----------- ----------- 
         13           1          10           0 
(End of results)

As you can see from the first expression udf_Txt_CharIndexRev
gives the position of the string that we're looking for from the
beginning of string that's being searched.

This gives us a useful function.  I've used it most often 
for tasks like parsing the file name from of a complete path. I'm
sure you'll find other uses for it.

udf_Txt_CharIndexRev is simple enough that it's possible to
replace it with an expression.  This gives us an opportunity
to explore one of the sections of the documentation block 
and to do some experimentation into the performance of UDFs.

The section in the documentation block labeled "Equivalent 
Template" tells how to achieve the same result as the UDF without
using the UDF.  Yea, this UDF tells you how to get by without
using it.  It's done in the name of efficiency. 

However useful a user-defined function is, from a performance
point of view, using the UDFs is often the equivalent of using
cursors.  They're great when you need them. Sometimes they're 
essential. But they can be a performance headache.

My philosophy about writing efficient code is: 

    Writing code is an economic activity with tradeoffs between
    the cost of writing code and the cost of running it.  The
    best course of action is to write good, easy to maintain 
    code and use a basic concern for performance to eliminate 
    any obvious performance problem. But don't try and optimize
    everything.  Then during testing and as needed, refine any
    code that has become a performance issue.  While this 
    strategy permits some less than optimal code into a system,
    it helps your system get finished.  
   
Without such an approach, very little gets completed.  And my 
job is always to deliver code that can be used in production.
If a project dangles near completion forever I haven't 
accomplished my goal.

Using UDFs is great! Most of the time.  But there are times when
the extra overhead of the UDF execution method become too slow. 
Eliminating the UDF, where possible, is a strategy that usually
produces dramatically faster query execution times.  By 
"dramatically" I mean 2 times to 500 times faster.  Enough to make
a real difference in the response time of your application.

The "Equivalent Template" section of the documentation block is used
to replace a function call to udf_Txt_CharIndexRev with the text
in the template and then using SQL Query Analyzer's menu item 
Edit\Replace Template Parameters... to put the function 
parameters into the revised SQL at the locations where they 
belong.  I'm using the Query Analyzer template facility because
the parameters have to be inserted in several places.

To test this out we're going to construct a table, 
ExampleNumberString, with a large number of rows.  Here's the 
table structure:

   CREATE TABLE ExampleNumberString (
                ID int identity (1,1)
              , BigNum Numeric (38, 0)
              , NumberString varchar(128) NULL
                )

Don't bother creating in on your system yet. There is a 
stored procedure to create and populate it a few paragraphs
below. 

Let's construct a query that uses udf_Txt_CharIndexRev to
search for the string '83' from the back of the NumberString
column and then replace the UDF with the equivalent expression.
First, here's the query:

SELECT dbo.udf_Txt_CharIndexRev ('83', NumberString)
                 as [Index of 83 From the string end]
    FROM ExampleNumberString

Next, take a look at the "Equivalent Template":

       CASE 
         WHEN CHARINDEX(
                      , ) > 0
         THEN LEN() 
              - CHARINDEX(REVERSE()
                  , REVERSE ()) + 1
         ELSE 0 END

After using the Query Analyzer's menu item 
Edit\Replace Template Parameters...  The template parameters
are replaced with the expressions supplied to the "Request
Template Parameters" dialog box.

Figure 1 has the "Replace Template Parameters" dialog box.  
                          Figure 1
After replacing the template parameters we have this revised 
query:

SELECT CASE WHEN CHARINDEX('83', NumberString) > 0
         THEN LEN(NumberString) 
              - CHARINDEX(REVERSE('83')
                  , REVERSE (NumberString)) + 1
         ELSE 0 END as [Index of 83 From the string end]
    FROM ExampleNumberString

The results of the two queries are the same.  But in the 
second one the UDF has been eliminated.

Is the query without the UDF really faster?  It might be.  
I think it is. But how do we know?

Experimentation will tell.  And to conduct a proper experiment
we're going to need more then the 100 row table that you'll find
in the Northwind or pubs database.  The next stored procedure, 
usp_CreateExampleNumberString, creates the ExampleNumberString 
table and populates it.  The NumberString column is a 20 to 23
digit numeric string that we can use for searching.  Here's the
CREATE PROC script:

/------------- Copy From Below this line ----------------------\
CREATE PROC usp_CreateExampleNumberString 

    @Loops int = 20 -- creates POWER (2, @Loops) Rows 
                    -- 20 Creates 1,048,576 rows
AS

    DECLARE @LC int -- Loop counter

    -- delete an existing ExampleNumberString table
    if exists (select * from dbo.sysobjects 
                where id = object_id(N'dbo.ExampleNumberString')
                     and OBJECTPROPERTY(id, N'IsUserTable') = 1)
        DROP TABLE dbo.ExampleNumberString

   CREATE TABLE ExampleNumberString (
                ID int identity (1,1)
              , BigNum Numeric (38, 0)
              , NumberString varchar(128) NULL
                )

    INSERT INTO ExampleNumberString  (BigNum, NumberString)
          VALUES( CONVERT (numeric(38,0), rand() * 9999999999999)
                , '                            ') -- preallocate


    SELECT @LC = 0
    WHILE @LC < @Loops BEGIN

        INSERT INTO ExampleNumberString (BigNum, NumberString)
            SELECT BigNum * RAND(@LC + 1) * 2  
                  , '                             '
                FROM ExampleNumberString

        SELECT @LC = @LC + 1
    END -- WHILE

   UPDATE ExampleNumberString
        SET NumberString = convert(varchar(128)
                     , convert(numeric(38,0), 9834311) * bignum)
\------------ Stop copying above this line --------------------/

Although I pulled usp_CreateNumberString together pretty quickly
it does a reasonable job of producing a different large
NumberString for each row.  I found about one percent duplication
of numbers, which I find acceptable for most testing scenarios.
Creating a meg of rows took about 40 seconds on my desktop 
testing machine. 

Once you've created the procedure, run it to create as many
rows as you like.  The parameter to usp_CreateExampleNumberString
is use as the exponent of 2 to calculate the number of rows 
desired.  So 3 creates 8 rows and 20 creates 1,048,576 rows.
I suggest that you use at least 8192 rows (@Loops = 13) to get
meaningful results.

The next script uses usp_CreateExampleNumberString to create and
populate the ExampleNumberString table. Twenty is the default 
and that creates the million+ rows that we want:
/------------- Copy From Below this line ----------------------\
exec usp_CreateExampleNumberString default -- for a million+ rows
select count(*) from ExampleNumberString -- verify the count
select top 10 * from ExampleNumberString -- check out a few rows
go
\------------ Stop copying above this line --------------------/

Since we're going to run two queries against the NumberString
column to test the time it takes to use a UDF on the million rows,
we should try and make the circumstances for the two queries 
as close as possible.  One way to do that is to force all the
rows that we'll query into memory and keep them there.  

If we don't have all the desired rows in memory then we might 
be comparing one query that reads a million rows from disk with 
another query that reads the same million rows from SQL Server's 
page cache.  

SQL Server provides the DBCC PINTABLE to force all pages from 
a table to remain in the cache once they've been read the first
time.  Use it with caution.  It can fill SQL Server's cache and 
cause the query engine to lock up to the point where you have to
shut it down and restart it.  So only do this if you have
adequate RAM.  There's no point using virtual memory as a
substitute RAM.  That just substitutes one form of disk I/O 
(paging) for the one we're trying to eliminate (reading pages 
from disk). 

This next script pins ExampleNumberString in memory.  On my 
test system with a million row table it forced SQL Server to 
consume 82 megabytes of RAM.  The test system has 512 megabytes
of RAM and I can use the task manager to see that memory isn't
full.  If your system has less available RAM, reduce the number
of rows in the test to eliminate paging.

/------------- Copy From Below this line ----------------------\
-- Pin ExampleNumberString into memory
-- Be sure that you have enough memory available before you do 
-- this.  My SQL Server Process grew to 82 Megabytes when I ran
-- this script.
DECLARE @db_id int, @tbl_id int
SET @db_id = DB_ID('TSQLWorking')
SET @tbl_id = OBJECT_ID('ExampleNumberString')
DBCC PINTABLE (@db_id, @tbl_id)
GO
-- Now read all the rows to force the pages into the cache
SELECT * from ExampleNumberString
GO
\------------ Stop copying above this line --------------------/

The stage is set for comparing the two queries that were
created above and a new query, Query #0, which just reads the 
data and does a very simple calculation (the LEN function) on it.

Before we run the queries, I want to throw in one more wrinkle.
Any query that returns a million rows to SQL Query Analyzer's 
results window is going do a lot of work on sending, receiving,
and displaying the results.  To eliminate most of that work, 
I've used the MAX function to just return biggest result.  Our 
three queries with their results are:

Query #0:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX( LEN(NumberString )) as [Max Len]
    FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time: 
   CPU time = 2 ms, elapsed time = 2 ms.Max Length  
----------- 
         23 

SQL Server Execution Times:
   CPU time = 991 ms,  elapsed time = 991 ms.
(End of Results for Query #0)   
    
Query #1:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX (dbo.udf_Txt_CharIndexRev ('83', NumberString))
             as [Right Most Position]
    FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time: 
   CPU time = 2 ms, elapsed time = 2 ms.
Right Most Position 
------------------- 
                 23 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
(End of Results - This actually took 4 minutes 38 seconds)
(               - CPU was working at 100% during that time)


Query #2:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX (CASE WHEN CHARINDEX('83', NumberString) > 0
         THEN LEN(NumberString) - CHARINDEX(REVERSE('83')
                  , REVERSE (NumberString)) + 1
         ELSE 0 END  
        ) [Right Most Position]
    FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time: 
   CPU time = 3 ms, elapsed time = 3 ms.
Right Most Position 
------------------- 
                 23 

SQL Server Execution Times:
   CPU time = 2934 ms,  elapsed time = 2954 ms.
(End of Results)

Query #1, the UDF, didn't report it's execution times correctly.
It seems that "SET STATISTICS TIME" is limited in the duration
that it can measure.  I used the task manager to watch what was 
going on and the CPU time for Query #1 is very close to the 
elapsed time.  For our comparison we'll have to use the elapsed
time.

Here's the comparison:
Query  Description                Time  (milliseconds)  Net
----- --------------------------- -------------------- ---------
  #0  simple expression                        991         N/A
  #1  UDF                                   278000      277009
  #2  UDF replaced by expression              2954        1963

The Net column subtracts the time it took to run the very
simplest expression on the same set of strings that are used in
Queries #1 and #2.  I think that the Net column has the numbers 
that should be compared because it isolates just the effect of 
running the UDF or the replacement expression.

The difference in time is dramatic.  The UDF takes about 140 
times as much time to run as an equivalent expression. Wow!
That's the kind of difference the users would really notice.

Of course, the difference isn't going to be perceivable when 
the query is run on 1, 5 or even 100 rows.  When the number
of rows grows into the thousands the difference begins to
be noticeable.

The lesson that I draw from this experiment is that UDFs must
used with care in performance sensitive situations.  They're 
a great tool for simplifying code and in promoting code reuse. 
But they can have a dramatic negative effect on performance.


+--------------------------------------------------------------+
Do you have a great UDF that you'd like to share?  Or maybe you
have a T-SQL problem that you think could be solved by a UDF
but you don't know how? Send them to:

UDF@NovickSoftware.com 

and they might be published in this newsletter.  I try and 
respond to every request that I get.  

Thanks,  
Andrew Novick
+--------------------------------------------------------------+


RSS as HTML

Personal Blog

 
New Tips:

Loading SQL Profiler trace (.trc) files with fn_trace_gettrace

Upcoming
Presentations:

Big Data:
Working with Terabytes
in SQL Server

July 9 '08


Full Schedule