Let us go and see what will be the first step
that you need to take. the first step would be to analyze yourself that how challenging
or easy, it will be for you depending upon which technology you are in. So if
you have a good hold on programming languages like Java, Scala, Python, R, or SQL.
It's going to be pretty easy for you and on to that if you have an idea on data
warehousing, client-server architecture, Linux shell scripting, it will be very
easy for you to get into big data.
On the other hand, if your current experience
is on some different technologies in the IT industry, for example like sap, web
development, web servers applications, or java scripts then also it is not
completely impossible. However, yeah learning curve will be a bit on a higher
side. So now, let us see what could be the first step that you need to take the first
thing. you need to understand the use of big data, I mean why these companies
are using big data, what challenges the industry used to face when they were
not using big data, I mean this basic idea you should have that why you are
going to learn big data. Is it that big or it is just a market gimmick?
I am talking just the basic idea on what is
batch data, what is streaming or live data, structured, semi-structured or unstructured
data, what are eating concepts, what is Hadoop, what is spark hive, NoSQL databases,
HBase and data warehousing. I have mentioned it in one-liner points because you
just need to have very basic information like what it is, so that when you go
ahead and deep dive into Big Data, these terms should not sound new to you. So
just the basic idea of what these things are.
So let us say you start learning today. You can
finish it in somewhat around one month. I have given one month because many of
you I believe will be working with some of the other IT companies. so I'm not
sure how much time you will be able to give it. However, yes, as far as these
topics are concerned one month should be more than enough to get the basic idea.
Then once you have the basic idea, you need to see what is Hadoop, once you get
into the Big Data world, Hadoop should be one term that should be popping in
your mind all the time. Therefore, you should have at least all this basic idea,
which I have mentioned in these bullet points.
Hadoop architecture what is name node and data nodes,
file system resource, managers cluster, resource planning, that means how you
can plan a Hadoop cluster, important configuration files important processes and
demons, high availability, and fault tolerance. So let us give near about 10 to
12 days for understanding these basic points. Then once you are done with this, the next topic will be ETL. You need to understand what ETL is and what its use is
in big data. You need to see some important ETL concepts and important ETL
tools. Let us give this near about three to five days.
See friends in this article I am just telling
you the bullet points, that you should keep on your fingertips if you are going
into the big data world. I am NOT going into the deep what ideal concepts will be,
what ETL tools. I am just giving you a road map that will lead you to a good
position in the big data industry. The next step would be you have to understand
spark. Spark is one of the most used technology in the big data world. So you
need to understand what is a spark, why spark required in big data, how
are they related, you need to clear this up. Spark data structures and their
importance. Spark has three data structures, so you need to have a basic idea
of all of them. Then important spark libraries spark components like driver worker
executor etc. important spark functions and spark program execution lifecycle. Therefore,
for spark, it has a bit complicated so getting the basic idea itself will
give it seven to ten days.
Then your next step would be to understand
hive, what is a hive, why is hive required in the big data world, SQL. Then basic
hive commands, performance, tuning concepts, which are most important tuning
concepts like partitioning bucketing, then input-output file formats. Therefore,
this will suffice for a basic idea in the hive. So let us give it three to five
days so after understanding these concepts.
ETL Hadoop spark and hive you can say that you
have achieved a beginner level than what next. Whatever concepts you have
learned obviously in the programming world, you have to do a lot of hands-on. For
which, I will tell you a very basic setup which is very easy to set up on your
laptop or desktop and that will provide you basically all the tools which we
have discussed so far and you can get your hands dirty on it. Therefore, for
this, you have to download Oracle VirtualBox. Then once you have downloaded
Oracle VirtualBox, the second step would be to download Cloudera QuickStart VM,
this VirtualBox image would be somewhere around 150 to 200 MB but this Cloudera
QuickStart VM will be somewhere around 5 to 6 GB. This Cloudera QuickStart VM
includes big data tools like hive HBase Spark Scala Eclipse yarn zookeeper and
a complete Hadoop setup.
This is based on Linux and OS operating system
so for all this setup, you will not need more than one day and once you have
all this set up as a beginner you can start doing your hands-on. First thing,
you should practice the use case scenarios you will find most of the use case
scenarios on Google, initially, you should practice like file input-output and
HDFS how to load data in the hive, brush up some of your skills on Scala programming
and basic data operations in SPARK. Therefore, friends for these hands-on use
case scenarios will go on and on. I hope you like this article if you really
did please hit that like button and please subscribe to our website. Thank you.