|
|
SQL Server T-SQL User-Defined Function of the Week
CHARINDEX() from the Back of the String
Volume 1 Number 10 January 21, 2003
This article has been revised on June 16th, 2003.
If you copied the function udf_Txt_CharIndexRev from
an earlier issue, or if you got it in the newsletter,
please update you code with this newer version. I
apologize for any inconvenience that I may have caused.
Sign up for this newsletter at:
http://www.novicksoftware.com/UdfOfWeek/UDFofWeekSignup.htm
Before we get to the UDF of the week, I'd like to alert you to
two new UDF resources on the web. The first is a presentation
that I did for the New England SQL Server User Group in Waltham,
MA. You'll find the slides and the data at:
http://www.NovickSoftware.com/Presentations.htm
The second resource is a new web article that I wrote about using
Inline UDFs for managing web page data retrieval. You'll find
it at: http://www.15seconds.com/issue/030113.htm
I like 15seconds. They're one of the web sites that creates
new content rather than just republishing articles that
were created elsewhere.
This week's UDF is a useful text handling function
udf_Txt_CharIndexRev. It works just like the built in CHARINDEX
function except that it works from the back of the string.
Here's the CREATE FUNCTION script:
/------------- Copy From Below this line ----------------------\
CREATE FUNCTION dbo.udf_Txt_CharIndexRev (
@SearchFor varchar(255) -- Sequence to be found
, @SearchIn varchar(8000) -- The string to be searched
) RETURNS int -- Position from the back of the string where
-- @SearchFor is found in @SearchIn
WITH SCHEMABINDING
/*
* Searches for a string in another string working from the back.
* It reports the position (relative to the front) of the first
* such expression it finds. If the expression is not found, it
* returns zero.
*
* Equivalent Template:
CASE
WHEN CHARINDEX(
, ) > 0
THEN LEN()
- CHARINDEX(REVERSE()
, REVERSE ()) + 1
ELSE 0 END
* Example:
select dbo.udf_Txt_CharIndexRev('\', 'C:\temp\abcd.txt')
*
* Test:
PRINT 'Test 1 ' + CASE WHEN 8=
dbo.udf_Txt_CharIndexRev ('\', 'C:\temp\abcd.txt')
THEN 'Worked' ELSE 'ERROR' END
PRINT 'Test 2 ' + CASE WHEN 0=
dbo.udf_Txt_CharIndexRev ('*', 'C:\tmp\d.txt')
THEN 'Worked' ELSE 'ERROR' END
-- 12345678901234567890
PRINT 'Test 3 ' + CASE WHEN 18=
dbo.udf_Txt_CharIndexRev('fj', 'f123 asdasfdfdfddfjas ')
THEN 'Worked' ELSE 'ERROR' END
*
* History
* WHEN WHO WHAT
* ---------- ---------- -----------------------------------------
* 2002-12-10 ASN Initial Coding
* 2003-06-16 ASN Fixed problem with Test 3 by using
DATALENGTH instead of LEN
****************************************************************/
AS BEGIN
DECLARE @Result int
, @StringLen int
, @ReverseIn varchar(8000)
, @ReverseFor varchar(255)
SELECT @ReverseIn = REVERSE (@SearchIn)
, @ReverseFor = REVERSE(@SearchFor)
, @StringLen = DATALENGTH(@SearchIn)
SELECT @Result = CHARINDEX(@ReverseFor, @ReverseIn)
-- return the position from the front of the string
IF @Result > 0
SET @Result = @StringLen - (@Result
+ DATALENGTH(@SearchFor)
- 2
)
-- ENDIF
RETURN @Result
END
GO
GRANT EXEC on dbo.udf_Txt_CharIndexRev to PUBLIC
GO
\------------ Stop copying above this line --------------------/
Let's try out a few simple cases with this test query:
/------------- Copy From Below this line ----------------------\
select dbo.udf_Txt_CharIndexRev('fdf', 'f123 asdasfdfdfddfjas ')
as [Middle]
, dbo.udf_Txt_CharIndexRev('C', 'C:\temp\ab.txt') as [start]
, dbo.udf_Txt_CharIndexRev('X', '123456789X') as [end]
, dbo.udf_Txt_CharIndexRev('AB', '12347') as [missing]
GO
\------------ Stop copying above this line --------------------/
(Results)
Middle start end missing
----------- ----------- ----------- -----------
13 1 10 0
(End of results)
As you can see from the first expression udf_Txt_CharIndexRev
gives the position of the string that we're looking for from the
beginning of string that's being searched.
This gives us a useful function. I've used it most often
for tasks like parsing the file name from of a complete path. I'm
sure you'll find other uses for it.
udf_Txt_CharIndexRev is simple enough that it's possible to
replace it with an expression. This gives us an opportunity
to explore one of the sections of the documentation block
and to do some experimentation into the performance of UDFs.
The section in the documentation block labeled "Equivalent
Template" tells how to achieve the same result as the UDF without
using the UDF. Yea, this UDF tells you how to get by without
using it. It's done in the name of efficiency.
However useful a user-defined function is, from a performance
point of view, using the UDFs is often the equivalent of using
cursors. They're great when you need them. Sometimes they're
essential. But they can be a performance headache.
My philosophy about writing efficient code is:
Writing code is an economic activity with tradeoffs between
the cost of writing code and the cost of running it. The
best course of action is to write good, easy to maintain
code and use a basic concern for performance to eliminate
any obvious performance problem. But don't try and optimize
everything. Then during testing and as needed, refine any
code that has become a performance issue. While this
strategy permits some less than optimal code into a system,
it helps your system get finished.
Without such an approach, very little gets completed. And my
job is always to deliver code that can be used in production.
If a project dangles near completion forever I haven't
accomplished my goal.
Using UDFs is great! Most of the time. But there are times when
the extra overhead of the UDF execution method become too slow.
Eliminating the UDF, where possible, is a strategy that usually
produces dramatically faster query execution times. By
"dramatically" I mean 2 times to 500 times faster. Enough to make
a real difference in the response time of your application.
The "Equivalent Template" section of the documentation block is used
to replace a function call to udf_Txt_CharIndexRev with the text
in the template and then using SQL Query Analyzer's menu item
Edit\Replace Template Parameters... to put the function
parameters into the revised SQL at the locations where they
belong. I'm using the Query Analyzer template facility because
the parameters have to be inserted in several places.
To test this out we're going to construct a table,
ExampleNumberString, with a large number of rows. Here's the
table structure:
CREATE TABLE ExampleNumberString (
ID int identity (1,1)
, BigNum Numeric (38, 0)
, NumberString varchar(128) NULL
)
Don't bother creating in on your system yet. There is a
stored procedure to create and populate it a few paragraphs
below.
Let's construct a query that uses udf_Txt_CharIndexRev to
search for the string '83' from the back of the NumberString
column and then replace the UDF with the equivalent expression.
First, here's the query:
SELECT dbo.udf_Txt_CharIndexRev ('83', NumberString)
as [Index of 83 From the string end]
FROM ExampleNumberString
Next, take a look at the "Equivalent Template":
CASE
WHEN CHARINDEX(
, ) > 0
THEN LEN()
- CHARINDEX(REVERSE()
, REVERSE ()) + 1
ELSE 0 END
After using the Query Analyzer's menu item
Edit\Replace Template Parameters... The template parameters
are replaced with the expressions supplied to the "Request
Template Parameters" dialog box.
Figure 1 has the "Replace Template Parameters" dialog box.

After replacing the template parameters we have this revised
query:
SELECT CASE WHEN CHARINDEX('83', NumberString) > 0
THEN LEN(NumberString)
- CHARINDEX(REVERSE('83')
, REVERSE (NumberString)) + 1
ELSE 0 END as [Index of 83 From the string end]
FROM ExampleNumberString
The results of the two queries are the same. But in the
second one the UDF has been eliminated.
Is the query without the UDF really faster? It might be.
I think it is. But how do we know?
Experimentation will tell. And to conduct a proper experiment
we're going to need more then the 100 row table that you'll find
in the Northwind or pubs database. The next stored procedure,
usp_CreateExampleNumberString, creates the ExampleNumberString
table and populates it. The NumberString column is a 20 to 23
digit numeric string that we can use for searching. Here's the
CREATE PROC script:
/------------- Copy From Below this line ----------------------\
CREATE PROC usp_CreateExampleNumberString
@Loops int = 20 -- creates POWER (2, @Loops) Rows
-- 20 Creates 1,048,576 rows
AS
DECLARE @LC int -- Loop counter
-- delete an existing ExampleNumberString table
if exists (select * from dbo.sysobjects
where id = object_id(N'dbo.ExampleNumberString')
and OBJECTPROPERTY(id, N'IsUserTable') = 1)
DROP TABLE dbo.ExampleNumberString
CREATE TABLE ExampleNumberString (
ID int identity (1,1)
, BigNum Numeric (38, 0)
, NumberString varchar(128) NULL
)
INSERT INTO ExampleNumberString (BigNum, NumberString)
VALUES( CONVERT (numeric(38,0), rand() * 9999999999999)
, ' ') -- preallocate
SELECT @LC = 0
WHILE @LC < @Loops BEGIN
INSERT INTO ExampleNumberString (BigNum, NumberString)
SELECT BigNum * RAND(@LC + 1) * 2
, ' '
FROM ExampleNumberString
SELECT @LC = @LC + 1
END -- WHILE
UPDATE ExampleNumberString
SET NumberString = convert(varchar(128)
, convert(numeric(38,0), 9834311) * bignum)
\------------ Stop copying above this line --------------------/
Although I pulled usp_CreateNumberString together pretty quickly
it does a reasonable job of producing a different large
NumberString for each row. I found about one percent duplication
of numbers, which I find acceptable for most testing scenarios.
Creating a meg of rows took about 40 seconds on my desktop
testing machine.
Once you've created the procedure, run it to create as many
rows as you like. The parameter to usp_CreateExampleNumberString
is use as the exponent of 2 to calculate the number of rows
desired. So 3 creates 8 rows and 20 creates 1,048,576 rows.
I suggest that you use at least 8192 rows (@Loops = 13) to get
meaningful results.
The next script uses usp_CreateExampleNumberString to create and
populate the ExampleNumberString table. Twenty is the default
and that creates the million+ rows that we want:
/------------- Copy From Below this line ----------------------\
exec usp_CreateExampleNumberString default -- for a million+ rows
select count(*) from ExampleNumberString -- verify the count
select top 10 * from ExampleNumberString -- check out a few rows
go
\------------ Stop copying above this line --------------------/
Since we're going to run two queries against the NumberString
column to test the time it takes to use a UDF on the million rows,
we should try and make the circumstances for the two queries
as close as possible. One way to do that is to force all the
rows that we'll query into memory and keep them there.
If we don't have all the desired rows in memory then we might
be comparing one query that reads a million rows from disk with
another query that reads the same million rows from SQL Server's
page cache.
SQL Server provides the DBCC PINTABLE to force all pages from
a table to remain in the cache once they've been read the first
time. Use it with caution. It can fill SQL Server's cache and
cause the query engine to lock up to the point where you have to
shut it down and restart it. So only do this if you have
adequate RAM. There's no point using virtual memory as a
substitute RAM. That just substitutes one form of disk I/O
(paging) for the one we're trying to eliminate (reading pages
from disk).
This next script pins ExampleNumberString in memory. On my
test system with a million row table it forced SQL Server to
consume 82 megabytes of RAM. The test system has 512 megabytes
of RAM and I can use the task manager to see that memory isn't
full. If your system has less available RAM, reduce the number
of rows in the test to eliminate paging.
/------------- Copy From Below this line ----------------------\
-- Pin ExampleNumberString into memory
-- Be sure that you have enough memory available before you do
-- this. My SQL Server Process grew to 82 Megabytes when I ran
-- this script.
DECLARE @db_id int, @tbl_id int
SET @db_id = DB_ID('TSQLWorking')
SET @tbl_id = OBJECT_ID('ExampleNumberString')
DBCC PINTABLE (@db_id, @tbl_id)
GO
-- Now read all the rows to force the pages into the cache
SELECT * from ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
The stage is set for comparing the two queries that were
created above and a new query, Query #0, which just reads the
data and does a very simple calculation (the LEN function) on it.
Before we run the queries, I want to throw in one more wrinkle.
Any query that returns a million rows to SQL Query Analyzer's
results window is going do a lot of work on sending, receiving,
and displaying the results. To eliminate most of that work,
I've used the MAX function to just return biggest result. Our
three queries with their results are:
Query #0:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX( LEN(NumberString )) as [Max Len]
FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time:
CPU time = 2 ms, elapsed time = 2 ms.Max Length
-----------
23
SQL Server Execution Times:
CPU time = 991 ms, elapsed time = 991 ms.
(End of Results for Query #0)
Query #1:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX (dbo.udf_Txt_CharIndexRev ('83', NumberString))
as [Right Most Position]
FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time:
CPU time = 2 ms, elapsed time = 2 ms.
Right Most Position
-------------------
23
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(End of Results - This actually took 4 minutes 38 seconds)
( - CPU was working at 100% during that time)
Query #2:
/------------- Copy From Below this line ----------------------\
SET STATISTICS TIME ON
SELECT MAX (CASE WHEN CHARINDEX('83', NumberString) > 0
THEN LEN(NumberString) - CHARINDEX(REVERSE('83')
, REVERSE (NumberString)) + 1
ELSE 0 END
) [Right Most Position]
FROM ExampleNumberString
GO
\------------ Stop copying above this line --------------------/
(Results)
SQL Server parse and compile time:
CPU time = 3 ms, elapsed time = 3 ms.
Right Most Position
-------------------
23
SQL Server Execution Times:
CPU time = 2934 ms, elapsed time = 2954 ms.
(End of Results)
Query #1, the UDF, didn't report it's execution times correctly.
It seems that "SET STATISTICS TIME" is limited in the duration
that it can measure. I used the task manager to watch what was
going on and the CPU time for Query #1 is very close to the
elapsed time. For our comparison we'll have to use the elapsed
time.
Here's the comparison:
Query Description Time (milliseconds) Net
----- --------------------------- -------------------- ---------
#0 simple expression 991 N/A
#1 UDF 278000 277009
#2 UDF replaced by expression 2954 1963
The Net column subtracts the time it took to run the very
simplest expression on the same set of strings that are used in
Queries #1 and #2. I think that the Net column has the numbers
that should be compared because it isolates just the effect of
running the UDF or the replacement expression.
The difference in time is dramatic. The UDF takes about 140
times as much time to run as an equivalent expression. Wow!
That's the kind of difference the users would really notice.
Of course, the difference isn't going to be perceivable when
the query is run on 1, 5 or even 100 rows. When the number
of rows grows into the thousands the difference begins to
be noticeable.
The lesson that I draw from this experiment is that UDFs must
used with care in performance sensitive situations. They're
a great tool for simplifying code and in promoting code reuse.
But they can have a dramatic negative effect on performance.
+--------------------------------------------------------------+
Do you have a great UDF that you'd like to share? Or maybe you
have a T-SQL problem that you think could be solved by a UDF
but you don't know how? Send them to:
UDF@NovickSoftware.com
and they might be published in this newsletter. I try and
respond to every request that I get.
Thanks,
Andrew Novick
+--------------------------------------------------------------+
|
|