Why SSIS always gets Excel data types wrong, and how to fix it!
Working as a BI Developer with the Microsoft stack of tools, I often deal with multiple sources of data, one of which being Excel, a very common data source in Data Warehousing projects. As some of you may be aware, working with Excel in SSIS (SQL Server Integration Services) can have its issues, if you're aware of these, this post will help people in more detail and provide you with some possible solutions to save you some time.
This blog will also help answer questions about:
- Why doesn’t SSIS pick up my data types correctly when using Excel as a source
- Why are some of my columns NULL when the excel file contains data
- How do I fix problems with my excel source
- Excel SSIS driver issues
The main driver for the problems we experience whilst working with SSIS and Excel originate from the fact SSIS auto-determines the data types for when connecting to the Excel source. The SSIS Excel driver determines the data type for each column based on reading values of the first 8 rows.
- If the top 8 records contain equal number of numeric and character types – then the priority is numeric (column2 in Figures 1 & 2)
- If the majority of top 8 records are numeric then it assigns the data type as numeric and all character values are read as NULLs (column3 in Figures 1 &2)
- If the majority of top 8 records are of character type then it assigns the data type as string and all numeric values are read as NULLs (column4 in Figures 1 & 2)
As you can see from the above screenshots, we receive NULLs in our destination table where we had data to load, this is not the desired output. So, how do we fix this?
By adding IMEX=1 (Import Export Mode) property to the connection string, excel reads the data correctly (see Figure 4).
Set the connection string property as below:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\Folder1\Book1.xls;Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1";
After adding this property, the excel driver reads the columns with mixed types as Unicode String data type (see Figure 5) and reads the data properly (see Figure 6).
Also in situations where the 1st 8 rows are NULLs in the excel file we need to edit the registry settings and set the TypeGuessRows property to 0 in order to let the excel driver to use all rows in the file to determine the data types properly (see Figure 7) by sampling all of the rows rather than the 1st 8. Range of values this property allows are from 0-16. So, we can have 1-16 rows, or all rows as permissible sample sizes.
The location of the key needing to changes is as follows:
HKEY_LOCAL_MACHINE -> SOFTWARE -> Wow6432Node -> Microsoft -> Jet -> 4.0 ->Engines -> Excel -> TypeGuessRows has the value 0. (It contains 8 by default)
I hope this helps investigating any Excel driver issues you may be experiencing. Please feel free to comment on the post with your experiences.