Do varchar(max), nvarchar(max) and varbinary(max) columns affect select queries?Speed up INSERTsDoes the Inserted table contain Identity values?deteriorating stored procedure running timesSelect Into removes IDENTITY property from target tableDuplicate records in primary key during big selectUnderstanding varchar(max) 8000 column and why I can store more than 8000 characters in itHow do I identify the natural key of a table?Will Re-Seeding an Identity column back to 0 cause page splits?Performance difference between Text and Varchar in MysqlError 666 on clustered primary key

Languages that we cannot (dis)prove to be Context-Free

Alternative to sending password over mail?

Java Casting: Java 11 throws LambdaConversionException while 1.8 does not

What is a clear way to write a bar that has an extra beat?

How is it possible to have an ability score that is less than 3?

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

Can I make popcorn with any corn?

Which country benefited the most from UN Security Council vetoes?

What's that red-plus icon near a text?

Is it inappropriate for a student to attend their mentor's dissertation defense?

How can I prevent hyper evolved versions of regular creatures from wiping out their cousins?

Watching something be written to a file live with tail

Can a Cauchy sequence converge for one metric while not converging for another?

What typically incentivizes a professor to change jobs to a lower ranking university?

expand `ifthenelse` immediately

What does the "remote control" for a QF-4 look like?

Did Shadowfax go to Valinor?

Codimension of non-flat locus

How does one intimidate enemies without having the capacity for violence?

Can a vampire attack twice with their claws using Multiattack?

How much of data wrangling is a data scientist's job?

Is it possible to do 50 km distance without any previous training?

Filter any system log file by date or date range

Client team has low performances and low technical skills: we always fix their work and now they stop collaborate with us. How to solve?



Do varchar(max), nvarchar(max) and varbinary(max) columns affect select queries?


Speed up INSERTsDoes the Inserted table contain Identity values?deteriorating stored procedure running timesSelect Into removes IDENTITY property from target tableDuplicate records in primary key during big selectUnderstanding varchar(max) 8000 column and why I can store more than 8000 characters in itHow do I identify the natural key of a table?Will Re-Seeding an Identity column back to 0 cause page splits?Performance difference between Text and Varchar in MysqlError 666 on clustered primary key






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5















Consider this table:



create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)


Let's imagine that we have over 100,000 books in this table.



Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.



One way to check for the duplicates is this way:



select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)


Does the existence of Text column affect this query's performance? If so, how can we optimized it?



P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:



create table BookUniqueTokens 
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)

create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)


And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text] column affects any select query on the UniqueToken column.










share|improve this question



















  • 2





    Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

    – Randi Vertongen
    Mar 24 at 11:56






  • 1





    1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

    – Solomon Rutzky
    Mar 24 at 16:04












  • @RandiVertongen, yes there is a nonclustered index on UniqueToken

    – Saeed Neamati
    Mar 24 at 16:50











  • @SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

    – Saeed Neamati
    Mar 24 at 16:51











  • @SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

    – Randi Vertongen
    Mar 24 at 18:16

















5















Consider this table:



create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)


Let's imagine that we have over 100,000 books in this table.



Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.



One way to check for the duplicates is this way:



select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)


Does the existence of Text column affect this query's performance? If so, how can we optimized it?



P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:



create table BookUniqueTokens 
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)

create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)


And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text] column affects any select query on the UniqueToken column.










share|improve this question



















  • 2





    Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

    – Randi Vertongen
    Mar 24 at 11:56






  • 1





    1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

    – Solomon Rutzky
    Mar 24 at 16:04












  • @RandiVertongen, yes there is a nonclustered index on UniqueToken

    – Saeed Neamati
    Mar 24 at 16:50











  • @SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

    – Saeed Neamati
    Mar 24 at 16:51











  • @SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

    – Randi Vertongen
    Mar 24 at 18:16













5












5








5


2






Consider this table:



create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)


Let's imagine that we have over 100,000 books in this table.



Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.



One way to check for the duplicates is this way:



select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)


Does the existence of Text column affect this query's performance? If so, how can we optimized it?



P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:



create table BookUniqueTokens 
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)

create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)


And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text] column affects any select query on the UniqueToken column.










share|improve this question
















Consider this table:



create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)


Let's imagine that we have over 100,000 books in this table.



Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.



One way to check for the duplicates is this way:



select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)


Does the existence of Text column affect this query's performance? If so, how can we optimized it?



P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:



create table BookUniqueTokens 
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)

create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)


And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text] column affects any select query on the UniqueToken column.







sql-server performance t-sql datatypes






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 29 at 17:02









Solomon Rutzky

49.6k582180




49.6k582180










asked Mar 24 at 11:48









Saeed NeamatiSaeed Neamati

5501618




5501618







  • 2





    Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

    – Randi Vertongen
    Mar 24 at 11:56






  • 1





    1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

    – Solomon Rutzky
    Mar 24 at 16:04












  • @RandiVertongen, yes there is a nonclustered index on UniqueToken

    – Saeed Neamati
    Mar 24 at 16:50











  • @SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

    – Saeed Neamati
    Mar 24 at 16:51











  • @SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

    – Randi Vertongen
    Mar 24 at 18:16












  • 2





    Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

    – Randi Vertongen
    Mar 24 at 11:56






  • 1





    1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

    – Solomon Rutzky
    Mar 24 at 16:04












  • @RandiVertongen, yes there is a nonclustered index on UniqueToken

    – Saeed Neamati
    Mar 24 at 16:50











  • @SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

    – Saeed Neamati
    Mar 24 at 16:51











  • @SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

    – Randi Vertongen
    Mar 24 at 18:16







2




2





Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

– Randi Vertongen
Mar 24 at 11:56





Is there a nonclustered index on UniqueToken ? Also, I would not advise an IN with 10k items, I would store them in a temp table and filter the UniqueTokens with this temporary table. More on that here

– Randi Vertongen
Mar 24 at 11:56




1




1





1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

– Solomon Rutzky
Mar 24 at 16:04






1) If you are checking for duplicates, why would you include the Text column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken column? If you don't want to share actual company data, modify it, but keep the format the same.

– Solomon Rutzky
Mar 24 at 16:04














@RandiVertongen, yes there is a nonclustered index on UniqueToken

– Saeed Neamati
Mar 24 at 16:50





@RandiVertongen, yes there is a nonclustered index on UniqueToken

– Saeed Neamati
Mar 24 at 16:50













@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

– Saeed Neamati
Mar 24 at 16:51





@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.

– Saeed Neamati
Mar 24 at 16:51













@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

– Randi Vertongen
Mar 24 at 18:16





@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the Text column included, then I would look at the IN for query optimization. There are better ways to find duplicate data.

– Randi Vertongen
Mar 24 at 18:16










2 Answers
2






active

oldest

votes


















6














Examples



Consider your query with 8 filter predicates in your IN clause on a dataset of 10K records.



select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);


A clustered index scan is used, there are no other indexes present on this test table



enter image description here



With a data size of 216 Bytes.



You should also note how even with 8 records, the OR filters are stacking up.



The reads that happened on this table:



enter image description here



Credits to statisticsparser.



When you include the Text column in the select part of your query, the actual data size changes drastically:



select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);


Again, the Clustered index Scan with a residual predicate:



enter image description here



But with a dataset of 32KB.



As there are almost 1000 lob logical reads:



enter image description here



Now, when we create the two tables in question, and fill them up with the same 10k records



Executing the same select without Text. Remember that we had 99 Logical reads when using the Books Table.



select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)


The reads on BookUniqueTokens are lower, 67 instead of 99.



enter image description here



We can track that back to the pages in the original Books table and the pages in the new table without the Text.



Original Books table:



enter image description here



New BookUniqueTokens table



enter image description here



So, all the pages + (2 overhead pages?) are read from the clustered index.



Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)



Books Data space



enter image description here



BooksWithText Data space



enter image description here



The reason for this is ROW_OVERFLOW_DATA.



When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.



Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?



Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.



Source




To answer your questions




He claimed that [Text] column affects any select query on the
UniqueToken column.




And




Does the existence of Text column affect this query's performance? If
so, how can we optimized it?




If the data stored is actually Lob Data, and the Query provided in the answer is used



It does bring some overhead due to the 24 byte pointers.



Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.



Remember that this overhead only happens if an index that includes Text is used, such as the clustered index.



But, what if the clustered index scan is used, and the lob data does not exceed 8kb?



If the data does not exceed 8kb, and you have no index on UniqueToken,the overhead could be bigger . even when not selecting the Text column.



Logical reads on 10k records when Text is only 137 characters long (all records):




Table 'Books2'. Scan count 1, logical reads 419




Due to all this extra data being on the clustered index pages.



Again, an index on UniqueToken (Without including the Text column) will resolve this issue.



As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.




Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases




TL;DR



Based on the query given, indexing UniqueToken without including TEXT should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN statement.



EDIT:




yes there is a nonclustered index on UniqueToken




Your example query is not touching the Text column, and based on the query this should be a covering index.



If we test this on the three tables we previously used (UniqueToken + Lob data, Solely UniqueToken, UniqueToken + 137 Char data in nvarchar(max) column)



CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);


The reads remain the same for these three tables, because the nonclustered index is used.



Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Additional details



by @David Browne - Microsoft




Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/




Remember that you have to rebuild your indexes for this to take effect on already populated data.



By @Erik Darling



On



  • MAX Data Types Do WHAT?

Filtering on Lob data sucks.



  • Memory Grants and Data Size

Your memory grants might go through the roof when using bigger datatypes, impacting performance.






share|improve this answer
































    1














    Technically speaking, anything that takes up more space on a data page such that the data requires more data pages will decrease performance, even if by such a small amount that it cannot be easily measured. But more data pages means more operations required to read more pages, and more memory required to hold more data pages, and so on.



    So, if you are scanning a heap or index, then the presence of an NVARCHAR(MAX) column can impact performance, even if not selecting it. For example, if you have 5000 - 7000 bytes per row, then in the schema shown in the question, that would be stored in-row, creating the need for more data pages. Yet 8100 bytes (roughly) or more will guarantee that the data is stored off-row with just a pointer to the LOB page(s), so that wouldn't be so bad.



    But, in your case, since you mentioned having a non-clustered index on UniqueToken, then it really shouldn't matter nearly as much (or even at all), if there is an NVARCHAR(MAX) column with 5000-7000 bytes (causing 1 page per row) because the query should be looking at the index which only has the Id and UniqueToken columns in it. And, the operation should be doing a seek instead of a scan, so not reading all data pages in the index.



    Final consideration: unless you have really old hardware (meaning, no RAM and/or other processes hogging disk/CPU/RAM, in which case most queries would be affected, not just this one), then 100,000 rows is not a lot of rows. In fact, it's not even close to a lot of rows. 1 million rows wouldn't even be a lot of rows to make a huge difference here.



    So, assuming that your query is indeed using the nonclustered index, then I think we should look somewhere besides the NVARCHAR(MAX) column for the issue. This is not to say that sometimes splitting a table into two tables isn't the best choice, it's just doubtful that it would help here given the info provided.



    The three places I would look at for improvement are:



    1. Explicit schema names: This is minor, but still, always prefix schema-based objects with their schema name. So you should be using dbo.Books instead of just Books. Not only will this help in cases where multiple schemas are being used and different users have different default schemas, but it also reduces some locking that happens when the schema is not explicitly stated and SQL Server needs to check a few places for it.



    2. The IN list: These are convenient, but not known for scalability. IN lists expand to an OR condition for each item in the list. Meaning:



      where UniqueToken in 
      (
      'first unique token',
      'second unique token'
      -- 10,000 items here
      )


      becomes:



      where UniqueToken = 'first unique token'
      OR UniqueToken = 'second unique token'
      -- 10,000 items here (9,998 more OR conditions)


      As you add more items to the list, you get more OR conditions.



      Instead of building an IN list dynamically, create a local temporary table and create the list of INSERT statements. Also, wrap all of them in a transaction to avoid the transaction overhead that would otherwise occur per each INSERT (hence reducing 10,000 transactions down to 1):



      CREATE TABLE #UniqueTokens
      (
      UniqueToken VARCHAR(100) NOT NULL
      COLLATE Latin1_General_100_BIN2
      PRIMARY KEY
      );

      BEGIN TRAN;

      ..dynamically generated INSERT INTO #UniqueTokens (UniqueToken) VALUES ('...');

      COMMIT TRAN;


      Now that you have that list loaded, you can use it as follows to get the same set of duplicate tokens:



      SELECT bk.[UniqueToken]
      FROM dbo.Books bk
      INNER JOIN #UniqueTokens tmp
      ON tmp.[UniqueToken] = bk.[UniqueToken];


      Or, given that you want to know which of the 10,000 new entries you can load, you really want the list of non-duplicate tokens so that you can insert those, right? In which case you would do the following:



      SELECT tmp.[UniqueToken]
      FROM #UniqueTokens tmp
      WHERE NOT EXISTS(SELECT *
      FROM dbo.Books bk
      WHERE bk.[UniqueToken] = tmp.[UniqueToken]);



    3. String comparison: If there is no specific need for case-insensitive and/or accent-insensitive comparisons with regards to UniqueToken, and assuming that the database in which you created this table (that does not use the COLLATE clause for the [UniqueToken] column) does not have a binary default collation, then you can improve performance of the matching of UniqueToken values by using a binary comparison. Non-binary comparisons need to create a sort key for each value, and that sort key is based on linguistic rules for a specific culture (i.e. Latin1_General, French, Hebrew, Syriac, etc.). That is a lot of extra processing if the values simply need to be exactly the same. So do the following:



      1. Drop the nonclustered index on UniqueToken

      2. Alter the UniqueToken colummn to be VARCHAR(100) NOT NULL COLLATE Latin1_General_100_BIN2 (just like in the temp table shown above)

      3. Recreate the nonclustered index on UniqueToken






    share|improve this answer























      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "182"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f232941%2fdo-varcharmax-nvarcharmax-and-varbinarymax-columns-affect-select-queries%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      6














      Examples



      Consider your query with 8 filter predicates in your IN clause on a dataset of 10K records.



      select UniqueToken
      from Books
      where UniqueToken in
      (
      'Unique token 1',
      'Unique token 2',
      'Unique token 3',
      'Unique token 4',
      'Unique token 5',
      'Unique token 6',
      'Unique token 9999',
      'Unique token 5000'
      -- 10,000 items here
      );


      A clustered index scan is used, there are no other indexes present on this test table



      enter image description here



      With a data size of 216 Bytes.



      You should also note how even with 8 records, the OR filters are stacking up.



      The reads that happened on this table:



      enter image description here



      Credits to statisticsparser.



      When you include the Text column in the select part of your query, the actual data size changes drastically:



      select UniqueToken,Text
      from Books
      where UniqueToken in
      (
      'Unique token 1',
      'Unique token 2',
      'Unique token 3',
      'Unique token 4',
      'Unique token 5',
      'Unique token 6',
      'Unique token 9999',
      'Unique token 5000'
      -- 10,000 items here
      );


      Again, the Clustered index Scan with a residual predicate:



      enter image description here



      But with a dataset of 32KB.



      As there are almost 1000 lob logical reads:



      enter image description here



      Now, when we create the two tables in question, and fill them up with the same 10k records



      Executing the same select without Text. Remember that we had 99 Logical reads when using the Books Table.



      select UniqueToken
      from BookUniqueTokens
      where UniqueToken in
      (
      'Unique token 1',
      'Unique token 2',
      'Unique token 3',
      'Unique token 4',
      'Unique token 5',
      'Unique token 6',
      'Unique token 9999',
      'Unique token 5000'
      -- 10,000 items here
      )


      The reads on BookUniqueTokens are lower, 67 instead of 99.



      enter image description here



      We can track that back to the pages in the original Books table and the pages in the new table without the Text.



      Original Books table:



      enter image description here



      New BookUniqueTokens table



      enter image description here



      So, all the pages + (2 overhead pages?) are read from the clustered index.



      Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)



      Books Data space



      enter image description here



      BooksWithText Data space



      enter image description here



      The reason for this is ROW_OVERFLOW_DATA.



      When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.



      Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?



      Due to the 24 byte pointer added to the Clustered index to track each of these pages.
      After all, sql server needs to know where it can find the lob data.



      Source




      To answer your questions




      He claimed that [Text] column affects any select query on the
      UniqueToken column.




      And




      Does the existence of Text column affect this query's performance? If
      so, how can we optimized it?




      If the data stored is actually Lob Data, and the Query provided in the answer is used



      It does bring some overhead due to the 24 byte pointers.



      Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.



      Remember that this overhead only happens if an index that includes Text is used, such as the clustered index.



      But, what if the clustered index scan is used, and the lob data does not exceed 8kb?



      If the data does not exceed 8kb, and you have no index on UniqueToken,the overhead could be bigger . even when not selecting the Text column.



      Logical reads on 10k records when Text is only 137 characters long (all records):




      Table 'Books2'. Scan count 1, logical reads 419




      Due to all this extra data being on the clustered index pages.



      Again, an index on UniqueToken (Without including the Text column) will resolve this issue.



      As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.




      Also, if you do want the text stored off-row, you can force that
      without using a separate table. Just set the 'large value types out of
      row' option with sp_tableoption.
      docs.microsoft.com/en-us/sql/relational-databases




      TL;DR



      Based on the query given, indexing UniqueToken without including TEXT should resolve your troubles.
      Additionally, I would use a temporary table or table type to do the filtering instead of the IN statement.



      EDIT:




      yes there is a nonclustered index on UniqueToken




      Your example query is not touching the Text column, and based on the query this should be a covering index.



      If we test this on the three tables we previously used (UniqueToken + Lob data, Solely UniqueToken, UniqueToken + 137 Char data in nvarchar(max) column)



      CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
      CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
      CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);


      The reads remain the same for these three tables, because the nonclustered index is used.



      Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


      Additional details



      by @David Browne - Microsoft




      Also, if you do want the text stored off-row, you can force that
      without using a separate table. Just set the 'large value types out of
      row' option with sp_tableoption.
      docs.microsoft.com/en-us/sql/relational-databases/




      Remember that you have to rebuild your indexes for this to take effect on already populated data.



      By @Erik Darling



      On



      • MAX Data Types Do WHAT?

      Filtering on Lob data sucks.



      • Memory Grants and Data Size

      Your memory grants might go through the roof when using bigger datatypes, impacting performance.






      share|improve this answer





























        6














        Examples



        Consider your query with 8 filter predicates in your IN clause on a dataset of 10K records.



        select UniqueToken
        from Books
        where UniqueToken in
        (
        'Unique token 1',
        'Unique token 2',
        'Unique token 3',
        'Unique token 4',
        'Unique token 5',
        'Unique token 6',
        'Unique token 9999',
        'Unique token 5000'
        -- 10,000 items here
        );


        A clustered index scan is used, there are no other indexes present on this test table



        enter image description here



        With a data size of 216 Bytes.



        You should also note how even with 8 records, the OR filters are stacking up.



        The reads that happened on this table:



        enter image description here



        Credits to statisticsparser.



        When you include the Text column in the select part of your query, the actual data size changes drastically:



        select UniqueToken,Text
        from Books
        where UniqueToken in
        (
        'Unique token 1',
        'Unique token 2',
        'Unique token 3',
        'Unique token 4',
        'Unique token 5',
        'Unique token 6',
        'Unique token 9999',
        'Unique token 5000'
        -- 10,000 items here
        );


        Again, the Clustered index Scan with a residual predicate:



        enter image description here



        But with a dataset of 32KB.



        As there are almost 1000 lob logical reads:



        enter image description here



        Now, when we create the two tables in question, and fill them up with the same 10k records



        Executing the same select without Text. Remember that we had 99 Logical reads when using the Books Table.



        select UniqueToken
        from BookUniqueTokens
        where UniqueToken in
        (
        'Unique token 1',
        'Unique token 2',
        'Unique token 3',
        'Unique token 4',
        'Unique token 5',
        'Unique token 6',
        'Unique token 9999',
        'Unique token 5000'
        -- 10,000 items here
        )


        The reads on BookUniqueTokens are lower, 67 instead of 99.



        enter image description here



        We can track that back to the pages in the original Books table and the pages in the new table without the Text.



        Original Books table:



        enter image description here



        New BookUniqueTokens table



        enter image description here



        So, all the pages + (2 overhead pages?) are read from the clustered index.



        Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)



        Books Data space



        enter image description here



        BooksWithText Data space



        enter image description here



        The reason for this is ROW_OVERFLOW_DATA.



        When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.



        Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?



        Due to the 24 byte pointer added to the Clustered index to track each of these pages.
        After all, sql server needs to know where it can find the lob data.



        Source




        To answer your questions




        He claimed that [Text] column affects any select query on the
        UniqueToken column.




        And




        Does the existence of Text column affect this query's performance? If
        so, how can we optimized it?




        If the data stored is actually Lob Data, and the Query provided in the answer is used



        It does bring some overhead due to the 24 byte pointers.



        Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.



        Remember that this overhead only happens if an index that includes Text is used, such as the clustered index.



        But, what if the clustered index scan is used, and the lob data does not exceed 8kb?



        If the data does not exceed 8kb, and you have no index on UniqueToken,the overhead could be bigger . even when not selecting the Text column.



        Logical reads on 10k records when Text is only 137 characters long (all records):




        Table 'Books2'. Scan count 1, logical reads 419




        Due to all this extra data being on the clustered index pages.



        Again, an index on UniqueToken (Without including the Text column) will resolve this issue.



        As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.




        Also, if you do want the text stored off-row, you can force that
        without using a separate table. Just set the 'large value types out of
        row' option with sp_tableoption.
        docs.microsoft.com/en-us/sql/relational-databases




        TL;DR



        Based on the query given, indexing UniqueToken without including TEXT should resolve your troubles.
        Additionally, I would use a temporary table or table type to do the filtering instead of the IN statement.



        EDIT:




        yes there is a nonclustered index on UniqueToken




        Your example query is not touching the Text column, and based on the query this should be a covering index.



        If we test this on the three tables we previously used (UniqueToken + Lob data, Solely UniqueToken, UniqueToken + 137 Char data in nvarchar(max) column)



        CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
        CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
        CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);


        The reads remain the same for these three tables, because the nonclustered index is used.



        Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

        Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

        Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


        Additional details



        by @David Browne - Microsoft




        Also, if you do want the text stored off-row, you can force that
        without using a separate table. Just set the 'large value types out of
        row' option with sp_tableoption.
        docs.microsoft.com/en-us/sql/relational-databases/




        Remember that you have to rebuild your indexes for this to take effect on already populated data.



        By @Erik Darling



        On



        • MAX Data Types Do WHAT?

        Filtering on Lob data sucks.



        • Memory Grants and Data Size

        Your memory grants might go through the roof when using bigger datatypes, impacting performance.






        share|improve this answer



























          6












          6








          6







          Examples



          Consider your query with 8 filter predicates in your IN clause on a dataset of 10K records.



          select UniqueToken
          from Books
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          );


          A clustered index scan is used, there are no other indexes present on this test table



          enter image description here



          With a data size of 216 Bytes.



          You should also note how even with 8 records, the OR filters are stacking up.



          The reads that happened on this table:



          enter image description here



          Credits to statisticsparser.



          When you include the Text column in the select part of your query, the actual data size changes drastically:



          select UniqueToken,Text
          from Books
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          );


          Again, the Clustered index Scan with a residual predicate:



          enter image description here



          But with a dataset of 32KB.



          As there are almost 1000 lob logical reads:



          enter image description here



          Now, when we create the two tables in question, and fill them up with the same 10k records



          Executing the same select without Text. Remember that we had 99 Logical reads when using the Books Table.



          select UniqueToken
          from BookUniqueTokens
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          )


          The reads on BookUniqueTokens are lower, 67 instead of 99.



          enter image description here



          We can track that back to the pages in the original Books table and the pages in the new table without the Text.



          Original Books table:



          enter image description here



          New BookUniqueTokens table



          enter image description here



          So, all the pages + (2 overhead pages?) are read from the clustered index.



          Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)



          Books Data space



          enter image description here



          BooksWithText Data space



          enter image description here



          The reason for this is ROW_OVERFLOW_DATA.



          When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.



          Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?



          Due to the 24 byte pointer added to the Clustered index to track each of these pages.
          After all, sql server needs to know where it can find the lob data.



          Source




          To answer your questions




          He claimed that [Text] column affects any select query on the
          UniqueToken column.




          And




          Does the existence of Text column affect this query's performance? If
          so, how can we optimized it?




          If the data stored is actually Lob Data, and the Query provided in the answer is used



          It does bring some overhead due to the 24 byte pointers.



          Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.



          Remember that this overhead only happens if an index that includes Text is used, such as the clustered index.



          But, what if the clustered index scan is used, and the lob data does not exceed 8kb?



          If the data does not exceed 8kb, and you have no index on UniqueToken,the overhead could be bigger . even when not selecting the Text column.



          Logical reads on 10k records when Text is only 137 characters long (all records):




          Table 'Books2'. Scan count 1, logical reads 419




          Due to all this extra data being on the clustered index pages.



          Again, an index on UniqueToken (Without including the Text column) will resolve this issue.



          As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.




          Also, if you do want the text stored off-row, you can force that
          without using a separate table. Just set the 'large value types out of
          row' option with sp_tableoption.
          docs.microsoft.com/en-us/sql/relational-databases




          TL;DR



          Based on the query given, indexing UniqueToken without including TEXT should resolve your troubles.
          Additionally, I would use a temporary table or table type to do the filtering instead of the IN statement.



          EDIT:




          yes there is a nonclustered index on UniqueToken




          Your example query is not touching the Text column, and based on the query this should be a covering index.



          If we test this on the three tables we previously used (UniqueToken + Lob data, Solely UniqueToken, UniqueToken + 137 Char data in nvarchar(max) column)



          CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
          CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
          CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);


          The reads remain the same for these three tables, because the nonclustered index is used.



          Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

          Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

          Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


          Additional details



          by @David Browne - Microsoft




          Also, if you do want the text stored off-row, you can force that
          without using a separate table. Just set the 'large value types out of
          row' option with sp_tableoption.
          docs.microsoft.com/en-us/sql/relational-databases/




          Remember that you have to rebuild your indexes for this to take effect on already populated data.



          By @Erik Darling



          On



          • MAX Data Types Do WHAT?

          Filtering on Lob data sucks.



          • Memory Grants and Data Size

          Your memory grants might go through the roof when using bigger datatypes, impacting performance.






          share|improve this answer















          Examples



          Consider your query with 8 filter predicates in your IN clause on a dataset of 10K records.



          select UniqueToken
          from Books
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          );


          A clustered index scan is used, there are no other indexes present on this test table



          enter image description here



          With a data size of 216 Bytes.



          You should also note how even with 8 records, the OR filters are stacking up.



          The reads that happened on this table:



          enter image description here



          Credits to statisticsparser.



          When you include the Text column in the select part of your query, the actual data size changes drastically:



          select UniqueToken,Text
          from Books
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          );


          Again, the Clustered index Scan with a residual predicate:



          enter image description here



          But with a dataset of 32KB.



          As there are almost 1000 lob logical reads:



          enter image description here



          Now, when we create the two tables in question, and fill them up with the same 10k records



          Executing the same select without Text. Remember that we had 99 Logical reads when using the Books Table.



          select UniqueToken
          from BookUniqueTokens
          where UniqueToken in
          (
          'Unique token 1',
          'Unique token 2',
          'Unique token 3',
          'Unique token 4',
          'Unique token 5',
          'Unique token 6',
          'Unique token 9999',
          'Unique token 5000'
          -- 10,000 items here
          )


          The reads on BookUniqueTokens are lower, 67 instead of 99.



          enter image description here



          We can track that back to the pages in the original Books table and the pages in the new table without the Text.



          Original Books table:



          enter image description here



          New BookUniqueTokens table



          enter image description here



          So, all the pages + (2 overhead pages?) are read from the clustered index.



          Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)



          Books Data space



          enter image description here



          BooksWithText Data space



          enter image description here



          The reason for this is ROW_OVERFLOW_DATA.



          When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.



          Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?



          Due to the 24 byte pointer added to the Clustered index to track each of these pages.
          After all, sql server needs to know where it can find the lob data.



          Source




          To answer your questions




          He claimed that [Text] column affects any select query on the
          UniqueToken column.




          And




          Does the existence of Text column affect this query's performance? If
          so, how can we optimized it?




          If the data stored is actually Lob Data, and the Query provided in the answer is used



          It does bring some overhead due to the 24 byte pointers.



          Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.



          Remember that this overhead only happens if an index that includes Text is used, such as the clustered index.



          But, what if the clustered index scan is used, and the lob data does not exceed 8kb?



          If the data does not exceed 8kb, and you have no index on UniqueToken,the overhead could be bigger . even when not selecting the Text column.



          Logical reads on 10k records when Text is only 137 characters long (all records):




          Table 'Books2'. Scan count 1, logical reads 419




          Due to all this extra data being on the clustered index pages.



          Again, an index on UniqueToken (Without including the Text column) will resolve this issue.



          As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.




          Also, if you do want the text stored off-row, you can force that
          without using a separate table. Just set the 'large value types out of
          row' option with sp_tableoption.
          docs.microsoft.com/en-us/sql/relational-databases




          TL;DR



          Based on the query given, indexing UniqueToken without including TEXT should resolve your troubles.
          Additionally, I would use a temporary table or table type to do the filtering instead of the IN statement.



          EDIT:




          yes there is a nonclustered index on UniqueToken




          Your example query is not touching the Text column, and based on the query this should be a covering index.



          If we test this on the three tables we previously used (UniqueToken + Lob data, Solely UniqueToken, UniqueToken + 137 Char data in nvarchar(max) column)



          CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
          CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
          CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);


          The reads remain the same for these three tables, because the nonclustered index is used.



          Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

          Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

          Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


          Additional details



          by @David Browne - Microsoft




          Also, if you do want the text stored off-row, you can force that
          without using a separate table. Just set the 'large value types out of
          row' option with sp_tableoption.
          docs.microsoft.com/en-us/sql/relational-databases/




          Remember that you have to rebuild your indexes for this to take effect on already populated data.



          By @Erik Darling



          On



          • MAX Data Types Do WHAT?

          Filtering on Lob data sucks.



          • Memory Grants and Data Size

          Your memory grants might go through the roof when using bigger datatypes, impacting performance.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 24 at 18:46

























          answered Mar 24 at 12:40









          Randi VertongenRandi Vertongen

          4,5361924




          4,5361924























              1














              Technically speaking, anything that takes up more space on a data page such that the data requires more data pages will decrease performance, even if by such a small amount that it cannot be easily measured. But more data pages means more operations required to read more pages, and more memory required to hold more data pages, and so on.



              So, if you are scanning a heap or index, then the presence of an NVARCHAR(MAX) column can impact performance, even if not selecting it. For example, if you have 5000 - 7000 bytes per row, then in the schema shown in the question, that would be stored in-row, creating the need for more data pages. Yet 8100 bytes (roughly) or more will guarantee that the data is stored off-row with just a pointer to the LOB page(s), so that wouldn't be so bad.



              But, in your case, since you mentioned having a non-clustered index on UniqueToken, then it really shouldn't matter nearly as much (or even at all), if there is an NVARCHAR(MAX) column with 5000-7000 bytes (causing 1 page per row) because the query should be looking at the index which only has the Id and UniqueToken columns in it. And, the operation should be doing a seek instead of a scan, so not reading all data pages in the index.



              Final consideration: unless you have really old hardware (meaning, no RAM and/or other processes hogging disk/CPU/RAM, in which case most queries would be affected, not just this one), then 100,000 rows is not a lot of rows. In fact, it's not even close to a lot of rows. 1 million rows wouldn't even be a lot of rows to make a huge difference here.



              So, assuming that your query is indeed using the nonclustered index, then I think we should look somewhere besides the NVARCHAR(MAX) column for the issue. This is not to say that sometimes splitting a table into two tables isn't the best choice, it's just doubtful that it would help here given the info provided.



              The three places I would look at for improvement are:



              1. Explicit schema names: This is minor, but still, always prefix schema-based objects with their schema name. So you should be using dbo.Books instead of just Books. Not only will this help in cases where multiple schemas are being used and different users have different default schemas, but it also reduces some locking that happens when the schema is not explicitly stated and SQL Server needs to check a few places for it.



              2. The IN list: These are convenient, but not known for scalability. IN lists expand to an OR condition for each item in the list. Meaning:



                where UniqueToken in 
                (
                'first unique token',
                'second unique token'
                -- 10,000 items here
                )


                becomes:



                where UniqueToken = 'first unique token'
                OR UniqueToken = 'second unique token'
                -- 10,000 items here (9,998 more OR conditions)


                As you add more items to the list, you get more OR conditions.



                Instead of building an IN list dynamically, create a local temporary table and create the list of INSERT statements. Also, wrap all of them in a transaction to avoid the transaction overhead that would otherwise occur per each INSERT (hence reducing 10,000 transactions down to 1):



                CREATE TABLE #UniqueTokens
                (
                UniqueToken VARCHAR(100) NOT NULL
                COLLATE Latin1_General_100_BIN2
                PRIMARY KEY
                );

                BEGIN TRAN;

                ..dynamically generated INSERT INTO #UniqueTokens (UniqueToken) VALUES ('...');

                COMMIT TRAN;


                Now that you have that list loaded, you can use it as follows to get the same set of duplicate tokens:



                SELECT bk.[UniqueToken]
                FROM dbo.Books bk
                INNER JOIN #UniqueTokens tmp
                ON tmp.[UniqueToken] = bk.[UniqueToken];


                Or, given that you want to know which of the 10,000 new entries you can load, you really want the list of non-duplicate tokens so that you can insert those, right? In which case you would do the following:



                SELECT tmp.[UniqueToken]
                FROM #UniqueTokens tmp
                WHERE NOT EXISTS(SELECT *
                FROM dbo.Books bk
                WHERE bk.[UniqueToken] = tmp.[UniqueToken]);



              3. String comparison: If there is no specific need for case-insensitive and/or accent-insensitive comparisons with regards to UniqueToken, and assuming that the database in which you created this table (that does not use the COLLATE clause for the [UniqueToken] column) does not have a binary default collation, then you can improve performance of the matching of UniqueToken values by using a binary comparison. Non-binary comparisons need to create a sort key for each value, and that sort key is based on linguistic rules for a specific culture (i.e. Latin1_General, French, Hebrew, Syriac, etc.). That is a lot of extra processing if the values simply need to be exactly the same. So do the following:



                1. Drop the nonclustered index on UniqueToken

                2. Alter the UniqueToken colummn to be VARCHAR(100) NOT NULL COLLATE Latin1_General_100_BIN2 (just like in the temp table shown above)

                3. Recreate the nonclustered index on UniqueToken






              share|improve this answer



























                1














                Technically speaking, anything that takes up more space on a data page such that the data requires more data pages will decrease performance, even if by such a small amount that it cannot be easily measured. But more data pages means more operations required to read more pages, and more memory required to hold more data pages, and so on.



                So, if you are scanning a heap or index, then the presence of an NVARCHAR(MAX) column can impact performance, even if not selecting it. For example, if you have 5000 - 7000 bytes per row, then in the schema shown in the question, that would be stored in-row, creating the need for more data pages. Yet 8100 bytes (roughly) or more will guarantee that the data is stored off-row with just a pointer to the LOB page(s), so that wouldn't be so bad.



                But, in your case, since you mentioned having a non-clustered index on UniqueToken, then it really shouldn't matter nearly as much (or even at all), if there is an NVARCHAR(MAX) column with 5000-7000 bytes (causing 1 page per row) because the query should be looking at the index which only has the Id and UniqueToken columns in it. And, the operation should be doing a seek instead of a scan, so not reading all data pages in the index.



                Final consideration: unless you have really old hardware (meaning, no RAM and/or other processes hogging disk/CPU/RAM, in which case most queries would be affected, not just this one), then 100,000 rows is not a lot of rows. In fact, it's not even close to a lot of rows. 1 million rows wouldn't even be a lot of rows to make a huge difference here.



                So, assuming that your query is indeed using the nonclustered index, then I think we should look somewhere besides the NVARCHAR(MAX) column for the issue. This is not to say that sometimes splitting a table into two tables isn't the best choice, it's just doubtful that it would help here given the info provided.



                The three places I would look at for improvement are:



                1. Explicit schema names: This is minor, but still, always prefix schema-based objects with their schema name. So you should be using dbo.Books instead of just Books. Not only will this help in cases where multiple schemas are being used and different users have different default schemas, but it also reduces some locking that happens when the schema is not explicitly stated and SQL Server needs to check a few places for it.



                2. The IN list: These are convenient, but not known for scalability. IN lists expand to an OR condition for each item in the list. Meaning:



                  where UniqueToken in 
                  (
                  'first unique token',
                  'second unique token'
                  -- 10,000 items here
                  )


                  becomes:



                  where UniqueToken = 'first unique token'
                  OR UniqueToken = 'second unique token'
                  -- 10,000 items here (9,998 more OR conditions)


                  As you add more items to the list, you get more OR conditions.



                  Instead of building an IN list dynamically, create a local temporary table and create the list of INSERT statements. Also, wrap all of them in a transaction to avoid the transaction overhead that would otherwise occur per each INSERT (hence reducing 10,000 transactions down to 1):



                  CREATE TABLE #UniqueTokens
                  (
                  UniqueToken VARCHAR(100) NOT NULL
                  COLLATE Latin1_General_100_BIN2
                  PRIMARY KEY
                  );

                  BEGIN TRAN;

                  ..dynamically generated INSERT INTO #UniqueTokens (UniqueToken) VALUES ('...');

                  COMMIT TRAN;


                  Now that you have that list loaded, you can use it as follows to get the same set of duplicate tokens:



                  SELECT bk.[UniqueToken]
                  FROM dbo.Books bk
                  INNER JOIN #UniqueTokens tmp
                  ON tmp.[UniqueToken] = bk.[UniqueToken];


                  Or, given that you want to know which of the 10,000 new entries you can load, you really want the list of non-duplicate tokens so that you can insert those, right? In which case you would do the following:



                  SELECT tmp.[UniqueToken]
                  FROM #UniqueTokens tmp
                  WHERE NOT EXISTS(SELECT *
                  FROM dbo.Books bk
                  WHERE bk.[UniqueToken] = tmp.[UniqueToken]);



                3. String comparison: If there is no specific need for case-insensitive and/or accent-insensitive comparisons with regards to UniqueToken, and assuming that the database in which you created this table (that does not use the COLLATE clause for the [UniqueToken] column) does not have a binary default collation, then you can improve performance of the matching of UniqueToken values by using a binary comparison. Non-binary comparisons need to create a sort key for each value, and that sort key is based on linguistic rules for a specific culture (i.e. Latin1_General, French, Hebrew, Syriac, etc.). That is a lot of extra processing if the values simply need to be exactly the same. So do the following:



                  1. Drop the nonclustered index on UniqueToken

                  2. Alter the UniqueToken colummn to be VARCHAR(100) NOT NULL COLLATE Latin1_General_100_BIN2 (just like in the temp table shown above)

                  3. Recreate the nonclustered index on UniqueToken






                share|improve this answer

























                  1












                  1








                  1







                  Technically speaking, anything that takes up more space on a data page such that the data requires more data pages will decrease performance, even if by such a small amount that it cannot be easily measured. But more data pages means more operations required to read more pages, and more memory required to hold more data pages, and so on.



                  So, if you are scanning a heap or index, then the presence of an NVARCHAR(MAX) column can impact performance, even if not selecting it. For example, if you have 5000 - 7000 bytes per row, then in the schema shown in the question, that would be stored in-row, creating the need for more data pages. Yet 8100 bytes (roughly) or more will guarantee that the data is stored off-row with just a pointer to the LOB page(s), so that wouldn't be so bad.



                  But, in your case, since you mentioned having a non-clustered index on UniqueToken, then it really shouldn't matter nearly as much (or even at all), if there is an NVARCHAR(MAX) column with 5000-7000 bytes (causing 1 page per row) because the query should be looking at the index which only has the Id and UniqueToken columns in it. And, the operation should be doing a seek instead of a scan, so not reading all data pages in the index.



                  Final consideration: unless you have really old hardware (meaning, no RAM and/or other processes hogging disk/CPU/RAM, in which case most queries would be affected, not just this one), then 100,000 rows is not a lot of rows. In fact, it's not even close to a lot of rows. 1 million rows wouldn't even be a lot of rows to make a huge difference here.



                  So, assuming that your query is indeed using the nonclustered index, then I think we should look somewhere besides the NVARCHAR(MAX) column for the issue. This is not to say that sometimes splitting a table into two tables isn't the best choice, it's just doubtful that it would help here given the info provided.



                  The three places I would look at for improvement are:



                  1. Explicit schema names: This is minor, but still, always prefix schema-based objects with their schema name. So you should be using dbo.Books instead of just Books. Not only will this help in cases where multiple schemas are being used and different users have different default schemas, but it also reduces some locking that happens when the schema is not explicitly stated and SQL Server needs to check a few places for it.



                  2. The IN list: These are convenient, but not known for scalability. IN lists expand to an OR condition for each item in the list. Meaning:



                    where UniqueToken in 
                    (
                    'first unique token',
                    'second unique token'
                    -- 10,000 items here
                    )


                    becomes:



                    where UniqueToken = 'first unique token'
                    OR UniqueToken = 'second unique token'
                    -- 10,000 items here (9,998 more OR conditions)


                    As you add more items to the list, you get more OR conditions.



                    Instead of building an IN list dynamically, create a local temporary table and create the list of INSERT statements. Also, wrap all of them in a transaction to avoid the transaction overhead that would otherwise occur per each INSERT (hence reducing 10,000 transactions down to 1):



                    CREATE TABLE #UniqueTokens
                    (
                    UniqueToken VARCHAR(100) NOT NULL
                    COLLATE Latin1_General_100_BIN2
                    PRIMARY KEY
                    );

                    BEGIN TRAN;

                    ..dynamically generated INSERT INTO #UniqueTokens (UniqueToken) VALUES ('...');

                    COMMIT TRAN;


                    Now that you have that list loaded, you can use it as follows to get the same set of duplicate tokens:



                    SELECT bk.[UniqueToken]
                    FROM dbo.Books bk
                    INNER JOIN #UniqueTokens tmp
                    ON tmp.[UniqueToken] = bk.[UniqueToken];


                    Or, given that you want to know which of the 10,000 new entries you can load, you really want the list of non-duplicate tokens so that you can insert those, right? In which case you would do the following:



                    SELECT tmp.[UniqueToken]
                    FROM #UniqueTokens tmp
                    WHERE NOT EXISTS(SELECT *
                    FROM dbo.Books bk
                    WHERE bk.[UniqueToken] = tmp.[UniqueToken]);



                  3. String comparison: If there is no specific need for case-insensitive and/or accent-insensitive comparisons with regards to UniqueToken, and assuming that the database in which you created this table (that does not use the COLLATE clause for the [UniqueToken] column) does not have a binary default collation, then you can improve performance of the matching of UniqueToken values by using a binary comparison. Non-binary comparisons need to create a sort key for each value, and that sort key is based on linguistic rules for a specific culture (i.e. Latin1_General, French, Hebrew, Syriac, etc.). That is a lot of extra processing if the values simply need to be exactly the same. So do the following:



                    1. Drop the nonclustered index on UniqueToken

                    2. Alter the UniqueToken colummn to be VARCHAR(100) NOT NULL COLLATE Latin1_General_100_BIN2 (just like in the temp table shown above)

                    3. Recreate the nonclustered index on UniqueToken






                  share|improve this answer













                  Technically speaking, anything that takes up more space on a data page such that the data requires more data pages will decrease performance, even if by such a small amount that it cannot be easily measured. But more data pages means more operations required to read more pages, and more memory required to hold more data pages, and so on.



                  So, if you are scanning a heap or index, then the presence of an NVARCHAR(MAX) column can impact performance, even if not selecting it. For example, if you have 5000 - 7000 bytes per row, then in the schema shown in the question, that would be stored in-row, creating the need for more data pages. Yet 8100 bytes (roughly) or more will guarantee that the data is stored off-row with just a pointer to the LOB page(s), so that wouldn't be so bad.



                  But, in your case, since you mentioned having a non-clustered index on UniqueToken, then it really shouldn't matter nearly as much (or even at all), if there is an NVARCHAR(MAX) column with 5000-7000 bytes (causing 1 page per row) because the query should be looking at the index which only has the Id and UniqueToken columns in it. And, the operation should be doing a seek instead of a scan, so not reading all data pages in the index.



                  Final consideration: unless you have really old hardware (meaning, no RAM and/or other processes hogging disk/CPU/RAM, in which case most queries would be affected, not just this one), then 100,000 rows is not a lot of rows. In fact, it's not even close to a lot of rows. 1 million rows wouldn't even be a lot of rows to make a huge difference here.



                  So, assuming that your query is indeed using the nonclustered index, then I think we should look somewhere besides the NVARCHAR(MAX) column for the issue. This is not to say that sometimes splitting a table into two tables isn't the best choice, it's just doubtful that it would help here given the info provided.



                  The three places I would look at for improvement are:



                  1. Explicit schema names: This is minor, but still, always prefix schema-based objects with their schema name. So you should be using dbo.Books instead of just Books. Not only will this help in cases where multiple schemas are being used and different users have different default schemas, but it also reduces some locking that happens when the schema is not explicitly stated and SQL Server needs to check a few places for it.



                  2. The IN list: These are convenient, but not known for scalability. IN lists expand to an OR condition for each item in the list. Meaning:



                    where UniqueToken in 
                    (
                    'first unique token',
                    'second unique token'
                    -- 10,000 items here
                    )


                    becomes:



                    where UniqueToken = 'first unique token'
                    OR UniqueToken = 'second unique token'
                    -- 10,000 items here (9,998 more OR conditions)


                    As you add more items to the list, you get more OR conditions.



                    Instead of building an IN list dynamically, create a local temporary table and create the list of INSERT statements. Also, wrap all of them in a transaction to avoid the transaction overhead that would otherwise occur per each INSERT (hence reducing 10,000 transactions down to 1):



                    CREATE TABLE #UniqueTokens
                    (
                    UniqueToken VARCHAR(100) NOT NULL
                    COLLATE Latin1_General_100_BIN2
                    PRIMARY KEY
                    );

                    BEGIN TRAN;

                    ..dynamically generated INSERT INTO #UniqueTokens (UniqueToken) VALUES ('...');

                    COMMIT TRAN;


                    Now that you have that list loaded, you can use it as follows to get the same set of duplicate tokens:



                    SELECT bk.[UniqueToken]
                    FROM dbo.Books bk
                    INNER JOIN #UniqueTokens tmp
                    ON tmp.[UniqueToken] = bk.[UniqueToken];


                    Or, given that you want to know which of the 10,000 new entries you can load, you really want the list of non-duplicate tokens so that you can insert those, right? In which case you would do the following:



                    SELECT tmp.[UniqueToken]
                    FROM #UniqueTokens tmp
                    WHERE NOT EXISTS(SELECT *
                    FROM dbo.Books bk
                    WHERE bk.[UniqueToken] = tmp.[UniqueToken]);



                  3. String comparison: If there is no specific need for case-insensitive and/or accent-insensitive comparisons with regards to UniqueToken, and assuming that the database in which you created this table (that does not use the COLLATE clause for the [UniqueToken] column) does not have a binary default collation, then you can improve performance of the matching of UniqueToken values by using a binary comparison. Non-binary comparisons need to create a sort key for each value, and that sort key is based on linguistic rules for a specific culture (i.e. Latin1_General, French, Hebrew, Syriac, etc.). That is a lot of extra processing if the values simply need to be exactly the same. So do the following:



                    1. Drop the nonclustered index on UniqueToken

                    2. Alter the UniqueToken colummn to be VARCHAR(100) NOT NULL COLLATE Latin1_General_100_BIN2 (just like in the temp table shown above)

                    3. Recreate the nonclustered index on UniqueToken







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 29 at 16:47









                  Solomon RutzkySolomon Rutzky

                  49.6k582180




                  49.6k582180



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Database Administrators Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f232941%2fdo-varcharmax-nvarcharmax-and-varbinarymax-columns-affect-select-queries%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      He _____ here since 1970 . Answer needed [closed]What does “since he was so high” mean?Meaning of “catch birds for”?How do I ensure “since” takes the meaning I want?“Who cares here” meaningWhat does “right round toward” mean?the time tense (had now been detected)What does the phrase “ring around the roses” mean here?Correct usage of “visited upon”Meaning of “foiled rail sabotage bid”It was the third time I had gone to Rome or It is the third time I had been to Rome

                      Bunad

                      Færeyskur hestur Heimild | Tengill | Tilvísanir | LeiðsagnarvalRossið - síða um færeyska hrossið á færeyskuGott ár hjá færeyska hestinum