Tag Archives: Top 5 Mistakes in Hadoop

TOP 5 MISTAKES IN HADOOP AND HOW TO AVOID THEM

Top 5 Mistakes in Hadoop and how to avoid them

Hadoop has its strengths and difficulties. Business needs more specialized skills and data integration to factor into planning and implementation. Even though this happens, a large percentage of Hadoop implementations fail.

To help others avoid common mistakes with Hadoop, explore this article and know top 5 mistakes with Hadoop and how to avoid them.

MISTAKE 1: MIGRATE EVERYTHING BEFORE DEVISING A PLAN

As attractive as it can be to dive, head first into Hadoop, never start without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance. With first-time Hadoop implementations, users can expect a lot of error messages and a steep learning curves.

Successful implementation starts by identifying a business use case. Consider every phase of the process – from data ingestion to data transformation to analytics consumption, and even beyond to other applications and systems where analytics must be embedded. It also clearly determines how Hadoop and big data will create value for the business.

Our advice: Maximize your learning in the least amount of time by taking a holistic approach and starting with smaller test cases.

MISTAKE 2: ASSUME RATIONAL DATABASE SKILLSETS ARE TRANSFERABLE TO HADOOP

Hadoop is a distributed file system, not a traditional relational database (RDBMS). User can’t migrate all their relational data and manage it in Hadoop, nor can expect skillsets to be easily transferable between the two.

If the team is lacking Hadoop skills, it doesn’t necessarily mean you have to hire all new people. Every situation is different, and there are several options to consider. It might work best to train existing developers. User might be able to plug skills gaps with point solutions in some instances, but growing organizations tend to do better in the long run with an end-to-end data platform that serves a broad spectrum of users.

Our advice: It is important to look for software, along with the right combination of people, agility, and functionality to be successful. There are lot of tools available which automates some of the repetitive aspects of data ingestion and preparation.

MISTAKE 3: USER’S FIGURE-OUT SECURITY LATER

High profile data breaches have motivated most enterprise IT teams to prioritize protecting sensitive data. If the user considers using of big data, it’s important to keep in mind while processing sensitive data about the customers and partners. The user should never, ever, expose the card and bank details, and personally identifiable information about the clients, customers or the employees. Protection starts with planning ahead.

Our advice: Address each of the security solutions before deploying a big data project. Once a business need for big data has been established, decide who will be benefited from the investment and how it is going to impact the infrastructure.

MISTAKE 4: BRIDGING THE SKILLS GAP WITH TRADITIONAL ETL

Plugging the skills gap can be tricky for the organizations who are considering to solve big data’s ETL challenges. Many developers are proficient in Java, Python, and HiveQL, but may lack the experience to optimize performance on relational databases. When Hadoop and MapReduce are used for large scale traditional data management workloads such as ETL, this problem will be increased.

Some point solutions can help to plug the skills gap, but these tend to work best for experienced developers. If you’re dealing with smaller data sets, it might work to hire people who’ve had the proper training on big data and traditional implementations, or work with experts to train and guide staff through projects. But if you’re dealing with hundreds of terabytes of data, then you will need an enterprise-class ETL tool as part of a comprehensive business analytics platform.

Our advice: People, experience, and best practices are essential for successful Hadoop projects. While considering an expert or a team of experts as permanent hires or consultants, user should consider their experience with “traditional” as-well-as big data integration, the size and the complexity of the projects they’ve worked on, the organizations they worked with, and the number of successful implementations they have done. While dealing with large volumes of data, it might be the time to evaluate a comprehensive business analytics platform which is designed to operationalize and simplify Hadoop implementations.

MISTAKE 5: ENTERPRISE-LEVEL VALUE ON A SMALL BUDGET

The low-cost scalability of Hadoop is one of the reasons why organizations decide to use it. But many organizations fail to factor in data replication/compression, skilled resources, and overall management of big data integration of the existing ecosystem.

Hadoop is built to process enormous data files that continue to grow. It’s essential to do proper sizing up front. This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels. The compression of data also needs to be balanced with performance expectations for reading and writing data. Also, storing the data may cost 3x more than what the user has planned initially.

Our advice: Understand how the storage, resources, growth rates, and management of big data will factor into your existing ecosystem before you implement.