SQLServerBrains: Big Data

Showing posts with label Big Data. Show all posts

Thursday, 2 August 2012

Connect Excel to Hadoop on Azure via HiveODBC

One key feature of Microsoft’s Big Data Solution is solid integration of Apache Hadoop with the Microsoft Business Intelligence (BI) components. A good example of this is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster. This section walks you through using Excel via the Hive ODBC driver.

Install the Hive ODBC Driver

Prerequisites:

Ensure that Excel 2010 64-bit is already installed before installing the HiveODBC Driver.

Download the 64-bit Hive ODBC driver MSI file from the Portal by clicking the Download tile after logging into HadoopOnAzure.com This link is external to TechNet Wiki. It will open in a new window.

This link is external to TechNet Wiki. It will open in a new window.

Note, the HiveODBC driver includes both the HiveODBC Driver and the Excel Hive-AddIn.

Double click HiveODBCSetupx64.msi to start the installation.
Read the license agreement.
If you agree to the license agreement, click I agree and Install.

4. Once the installation has completed, click Finish to exit the Setup Wizard.

Install the Microsoft Excel Hive Add-In

Prerequisites:

Microsoft Excel 2010 64-bit
64bit Hive ODBC driver installed

Start Microsoft Excel 2010 64-bit.
You will be asked to install the HiveExcel add-in. Click Install.
Once the add-in has been installed, click the Data tab in Microsoft Excel 2010. You should see the Hive Panel as shown in the screenshot below.

Create a Hive ODBC Data Source to use with Excel

Click Start->Control Panel to launch the Control Panel for Microsoft Windows.
In the Control Panel, Click System and Security->Administrative Tools->Data Sources (ODBC). This will launch the ODBC Data Source Administrator dialog.

In the ODBC Data Source Administrator dialog, click the System DSN tab.
Click Add to add a new data source.
Click the HIVE driver in the ODBC driver list.

4. Then click Finish. This will launch the ODBC Hive Setup dialog shown in the screenshot below.

Enter a data source a name in the Data Source Name box. For Example, “MyHiveData”.
In the Host box , enter the host name of the cluster you created on the portal. For example, “myhadoopcluster.cloudapp.net”.
Enter the username you used to authenticate on the portal.
Click OK to save the new Hive data source.
Click OK to close the ODBC Data Source Administrator dialog.

Using the Excel Hive Add-In

Go to HadoopOnAzure.com and click Sign In to sign into your Hadoop cluster.
Click the Open Ports tile to access port configurations for your cluster.

3. On the Configure Ports page, click the toggle switch for the ODBC Server port to turn it on.

Open Microsoft Excel 2010 64-bit.
In Microsoft Excel 2010 64-bit, click the Data tab.
Click the Hive Panel to open the Hive panel in Excel.
In the drop-down list labeled Select or Enter Hive Connection, select the data source name you previously created.
You will be asked to enter the password to authenticate with the cluster on the portal. Enter the password for the user name.
In the drop-down list labeled Select the Hive Object to Query, select hivesampletable [Table].
Select the checkbox for each of the columns in the table. The Hive Query panel should look similar to the following.

11. Click Execute Query.

Microsoft BI Tools with Hadoop - Big Data Analytics

Microsoft’s goal for Big Data is to provide insights to all users from structured or unstructured data of any size. While very scalable, accommodating, and powerful, most Big Data solutions based on Hadoop require highly trained staff to deploy and manage. In addition, the benefits are limited to few highly technical users who are as comfortable programming their requirements as they are using advanced statistical techniques to extract value. For those of us who have been around the BI industry for a few years, this may sound similar to the early 90s where the benefits of our field were limited to a few within the corporation through the Executive Information Systems.

Analysis on Hadoop for Everyone

Microsoft entered the Business Intelligence industry to enable orders of magnitude more users to make better decisions from applications they use every day. This was the motivation behind being the first DBMS vendor to include an OLAP engine with the release of SQL Server 7.0 OLAP Services that enabled Excel users to ask business questions at the speed of thought. It remained the motivation behind PowerPivot in SQL Server 2008 R2, a self-service BI offering that allowed end users to build their own solutions without dependence on IT, as well as provided IT insights on how data was being consumed within the organization. And, with the release of Power View in SQL Server 2012, that goal will bring the power of rich interactive exploration directly in the hands of every user within an organization.

Enabling end users to merge data stored in a Hadoop deployment with data from other systems or with their own personal data is a natural next step. In fact, we also introduced Hive ODBC driver, currently in Community Technology Preview, at the PASS Summit in October. This driver allows connectivity to Apache Hive, which in turn facilitates querying and managing large datasets residing in distributed storage by exposing them as a data warehouse.

Microsoft BI connectivity with Hadoop

This connector brings the benefit of the entire Microsoft BI stack and ecosystem on Hive. A few examples include:
- Bring Hive data directly to Excel through the Microsoft Hive Add-in for Excel
- Build a PowerPivot workbook using data in Hive
- Build Power View reports on top of Hive
- Instead of manually refreshing a PowerPivot workbook based on Hive on their desktop, users can use PowerPivot for SharePoint to schedule a data refresh feature to refresh a central copy shared with others, without worrying about the time or resources it takes.
- BI Professionals can build BI Semantic Model or Reporting Services Reports on Hive in SQL Server Data tools
- Of course all of the 3^rd party client applications built on the Microsoft BI stack can now access Hive data as well!
Klout is a great customer that’s leveraging the Microsoft BI stack on Big Data to provide mission critical analysis for both internal users as well as to its customers

Best of both worlds

As we mentioned in the beginning of this blog article, one size doesn’t fit all, and it’s important to recognize the inherent strengths of options available to choose when to use what. Hadoop broadly provides:

- an inexpensive and highly scalable store for data in any shape,

- a robust execution infrastructure for data cleansing, shaping and analytical operations typically in a batch mode, and

- a growing ecosystem that provides highly skilled users many options to process data.

The Microsoft BI stack is targeted at significantly larger user population and provides:

- functionality in tools such as Excel and SharePoint that users are already familiar with,

- interactive queries at the speed of thought,

- business layer that allows users to understand the data, combine it with other sources, and express business logic in more accessible ways, and

- mechanisms to publish results for others to consume and build on themselves.

Successful projects may use both of these technologies in complementary manner, like Klout does. Enabling this choice has been the primary motivator for providing Hive ODBC connectivity, as well as investing in providing Hadoop-based distribution for Windows Server and Windows Azure.

Microsoft Big Data Solution SQL Server, Apache Hadoop and Windows Azure

Big Data Solution

Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class Hadoop based solutions on both Windows Server and Windows Azure. The roadmap includes Microsoft BI tools such as SQL Server Analysis Services, Reporting Services and even PowerPivot and Excel. This enables you to do BI on all your data, including those in Hadoop.

Key Benefits

Broader access of Hadoop to end users, IT professionals and Developers, through easy installation and configuration and simplified programming with JavaScript.
Enterprise ready Hadoop distribution with greater security, performance, ease of management and options for Hybrid IT usage.
Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.

Technical Considerations
On the more technical front, we have been working on a simplified download, installation and configuration experience of several Hadoop related technologies, including HDFS, Hive, and Pig, which will help broaden the adoption of Hadoop in the enterprise.
The Hadoop based service for Windows Azure will allow any developer or user to submit and run standard Hadoop jobs directly on the Azure cloud with a simple user experience. Therefore it doesn’t matter what platform you are developing your Hadoop jobs on -you will always be able to take a standard Hadoop job and deploy it on our platform, as we strive towards full interoperability with the official Apache Hadoop distribution.
This is great news as it lowers the barrier for building Hadoop based applications while encouraging rapid prototyping scenarios in the Windows Azure cloud for Big Data.To facilitate all of this, we have also entered into a strategic partnership with Hortonworks that enables us to gain unique experience and expertise to help accelerate the delivery of Microsoft’s Hadoop based distributions on both Windows Server and Windows Azure.
For developers, we will enable integration with Microsoft developer tools as well as invest in making Javascript a first class language for Big Data. We will do this by making it possible to write high performance Map/Reduce jobs using Javascript.For end users, the Hadoop-based applications targeting the Windows Server and Windows Azure platforms will easily work with Microsoft’s existing BI tools like PowerPivot and recently announced Power View, enabling self-service analysis on business information that was not previously accessible. To enable this we will be delivering an ODBC Driver and an Add-in for Excel, each of which will interoperate with Apache Hive.
Finally, in line with our commitment to Interoperability and to facilitate the high performance bi-directional movement of enterprise data between Apache Hadoop and Microsoft SQL Server, we have released two Hadoop-based connectors for SQL Server to manufacturing.
The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server 2008 R2, while the SQL Server PDW connector for Apache Hadoop moves data between Hadoop and SQL Server Parallel Data Warehouse (PDW). These new connectors will enable customers to work effectively with both structured and unstructured data.

Microsoft SQL Server 2012 Takes on Big Data with Hadoop

At PASS Summit 2011 in Seattle, one of the biggest surprise announcements was Microsoft's support for Apache Hadoop as a part of its SQL Server 2012 announcements.

To understand why Hadoop is important and how it relates to SQL Server, we need to get an idea of what Hadoop actually is and what it isn’t. First, Hadoop isn’t a relational database system, so it’s not a replacement or substitute for SQL Server. Hadoop is an open-source project that’s managed by the Apache Software Foundation. It was designed to solve a somewhat different problem—the problem of handling large amounts of unstructured data. SQL Server and other relational databases primarily store structured data. Data can be stored by using the XML and FileStream data types, but there can be limitations to the size, as well as the amount of processing power that can be applied to access the data. The basic technology behind Hadoop was originally developed by Google so that it could index all types of textual information. Google’s ideas were then incorporated into an open-source project named Nutch and later Yahoo! worked to transform Hadoop into an enterprise application. Hadoop is used by several notable companies, perhaps the most recognizable company is Facebook. In 2010, Facebook had the largest Hadoop cluster in the world, with more than 20PB of storage.

Hadoop is written in Java and runs on a collection of commodity shared-nothing servers. You can add or remove servers from a Hadoop cluster at anytime without disrupting the service. The more servers you use, the more computing power you get. A Hadoop implementation consists of two key components: the Hadoop Distributed File System (HDFS), which provides data storage across multiple servers, and high-performance parallel data processing, which uses a technique called MapReduce. MapReduce essentially splits up data discovery and indexing tasks by sending different parts to all of the servers in your cluster. Each server works on its own piece of the data. The results are then delivered back to the user as a complete set. In essence, MapReduce maps the operation out to all of the servers in the cluster and reduces the results into a single result set.

To implement Hadoop you can buy a collection of commodity servers and run the Hadoop software on each server to create a high-performance Hadoop cluster. For better scalability, you can add more servers. When you load all your data into Hadoop, the software breaks the data into pieces and distributes it across all the available servers. There's no central location in which you access your data. The Hadoop cluster keeps track of where the data resides and automatically stores multiple copies of the data. If a server fails or is removed from the cluster, Hadoop automatically replicates the data from a known copy.

Being an open-source product you might wonder what Hadoop has to do with Windows. At PASS Summit 2011, Microsoft announced that the company had created a Windows version of Hadoop that's able to run on Windows Server for on-premises implementations or on Windows Azure for cloud implementations. In addition, Microsoft is working with HortonWorks to develop bi-directional connectors for Hadoop and SQL Server. The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server 2008 R2 or SQL Server 2012. There will also be a SQL Server Parallel Data Warehouse (PDW) connector for Hadoop that transfers data between Hadoop and SQL Server PDW. These new connectors will enable customers to work with both structured SQL Server data and unstructured data from Hadoop.

Hadoop isn’t a replacement for SQL Server’s relational database. Instead, it provides new capabilities that weren’t previously unavailable. I think Microsoft’s view is that Hadoop will be used in conjunction with SQL Server’s relational and analytic capabilities to enable enterprises to deploy Hadoop implementations alongside their exiting IT systems. This will extend the types of data that you can use in your applications similarly to how SQL Server Analysis Services (SSAS) does with the SQL Server relational database engine. In addition, it'll help SQL Server better compete with both Oracle and IBM’s DB2, which have also embraced Hadoop. Big data is a rapidly growing a trend and the ability to incorporate big data with SQL Server is a big deal. Click here for more information about Microsoft’s implementation of Hadoop and how it relates to SQL Server.