SQLServerBrains: Apache Hadoop

Thursday, 2 August 2012

Connect Excel to Hadoop on Azure via HiveODBC

One key feature of Microsoft’s Big Data Solution is solid integration of Apache Hadoop with the Microsoft Business Intelligence (BI) components. A good example of this is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster. This section walks you through using Excel via the Hive ODBC driver.

Install the Hive ODBC Driver

Prerequisites:

Ensure that Excel 2010 64-bit is already installed before installing the HiveODBC Driver.

Download the 64-bit Hive ODBC driver MSI file from the Portal by clicking the Download tile after logging into HadoopOnAzure.com This link is external to TechNet Wiki. It will open in a new window.

This link is external to TechNet Wiki. It will open in a new window.

Note, the HiveODBC driver includes both the HiveODBC Driver and the Excel Hive-AddIn.

Double click HiveODBCSetupx64.msi to start the installation.
Read the license agreement.
If you agree to the license agreement, click I agree and Install.

4. Once the installation has completed, click Finish to exit the Setup Wizard.

Install the Microsoft Excel Hive Add-In

Prerequisites:

Microsoft Excel 2010 64-bit
64bit Hive ODBC driver installed

Start Microsoft Excel 2010 64-bit.
You will be asked to install the HiveExcel add-in. Click Install.
Once the add-in has been installed, click the Data tab in Microsoft Excel 2010. You should see the Hive Panel as shown in the screenshot below.

Create a Hive ODBC Data Source to use with Excel

Click Start->Control Panel to launch the Control Panel for Microsoft Windows.
In the Control Panel, Click System and Security->Administrative Tools->Data Sources (ODBC). This will launch the ODBC Data Source Administrator dialog.

In the ODBC Data Source Administrator dialog, click the System DSN tab.
Click Add to add a new data source.
Click the HIVE driver in the ODBC driver list.

4. Then click Finish. This will launch the ODBC Hive Setup dialog shown in the screenshot below.

Enter a data source a name in the Data Source Name box. For Example, “MyHiveData”.
In the Host box , enter the host name of the cluster you created on the portal. For example, “myhadoopcluster.cloudapp.net”.
Enter the username you used to authenticate on the portal.
Click OK to save the new Hive data source.
Click OK to close the ODBC Data Source Administrator dialog.

Using the Excel Hive Add-In

Go to HadoopOnAzure.com and click Sign In to sign into your Hadoop cluster.
Click the Open Ports tile to access port configurations for your cluster.

3. On the Configure Ports page, click the toggle switch for the ODBC Server port to turn it on.

Open Microsoft Excel 2010 64-bit.
In Microsoft Excel 2010 64-bit, click the Data tab.
Click the Hive Panel to open the Hive panel in Excel.
In the drop-down list labeled Select or Enter Hive Connection, select the data source name you previously created.
You will be asked to enter the password to authenticate with the cluster on the portal. Enter the password for the user name.
In the drop-down list labeled Select the Hive Object to Query, select hivesampletable [Table].
Select the checkbox for each of the columns in the table. The Hive Query panel should look similar to the following.

11. Click Execute Query.

Microsoft Big Data Solution SQL Server, Apache Hadoop and Windows Azure

Big Data Solution

Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class Hadoop based solutions on both Windows Server and Windows Azure. The roadmap includes Microsoft BI tools such as SQL Server Analysis Services, Reporting Services and even PowerPivot and Excel. This enables you to do BI on all your data, including those in Hadoop.

Key Benefits

Broader access of Hadoop to end users, IT professionals and Developers, through easy installation and configuration and simplified programming with JavaScript.
Enterprise ready Hadoop distribution with greater security, performance, ease of management and options for Hybrid IT usage.
Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.

Technical Considerations
On the more technical front, we have been working on a simplified download, installation and configuration experience of several Hadoop related technologies, including HDFS, Hive, and Pig, which will help broaden the adoption of Hadoop in the enterprise.
The Hadoop based service for Windows Azure will allow any developer or user to submit and run standard Hadoop jobs directly on the Azure cloud with a simple user experience. Therefore it doesn’t matter what platform you are developing your Hadoop jobs on -you will always be able to take a standard Hadoop job and deploy it on our platform, as we strive towards full interoperability with the official Apache Hadoop distribution.
This is great news as it lowers the barrier for building Hadoop based applications while encouraging rapid prototyping scenarios in the Windows Azure cloud for Big Data.To facilitate all of this, we have also entered into a strategic partnership with Hortonworks that enables us to gain unique experience and expertise to help accelerate the delivery of Microsoft’s Hadoop based distributions on both Windows Server and Windows Azure.
For developers, we will enable integration with Microsoft developer tools as well as invest in making Javascript a first class language for Big Data. We will do this by making it possible to write high performance Map/Reduce jobs using Javascript.For end users, the Hadoop-based applications targeting the Windows Server and Windows Azure platforms will easily work with Microsoft’s existing BI tools like PowerPivot and recently announced Power View, enabling self-service analysis on business information that was not previously accessible. To enable this we will be delivering an ODBC Driver and an Add-in for Excel, each of which will interoperate with Apache Hive.
Finally, in line with our commitment to Interoperability and to facilitate the high performance bi-directional movement of enterprise data between Apache Hadoop and Microsoft SQL Server, we have released two Hadoop-based connectors for SQL Server to manufacturing.
The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server 2008 R2, while the SQL Server PDW connector for Apache Hadoop moves data between Hadoop and SQL Server Parallel Data Warehouse (PDW). These new connectors will enable customers to work effectively with both structured and unstructured data.