One Click Answer

How to make your career in Big Data Tools?

April 28, 2020
By Sanjeev Dubey
0 Comments

[Get a job in big data,big data for beginner show to get a job in it industry jobs in Bangalore,big data jobs,big data scope,how to study for big data,how to study for spark,high demand IT jobs hadoop jobs,how to get a job in Hadoop,Best IT jobs getting started with Hadoop getting started with Big data,Jobs for beginners Coding jobs,big data interview questions,hadoop interview questions jobs after btech, high paying IT jobs, Big data salary

Let us go and see what will be the first step that you need to take. the first step would be to analyze yourself that how challenging or easy, it will be for you depending upon which technology you are in. So if you have a good hold on programming languages like Java, Scala, Python, R, or SQL. It's going to be pretty easy for you and on to that if you have an idea on data warehousing, client-server architecture, Linux shell scripting, it will be very easy for you to get into big data.

On the other hand, if your current experience is on some different technologies in the IT industry, for example like sap, web development, web servers applications, or java scripts then also it is not completely impossible. However, yeah learning curve will be a bit on a higher side. So now, let us see what could be the first step that you need to take the first thing. you need to understand the use of big data, I mean why these companies are using big data, what challenges the industry used to face when they were not using big data, I mean this basic idea you should have that why you are going to learn big data. Is it that big or it is just a market gimmick?

I am talking just the basic idea on what is batch data, what is streaming or live data, structured, semi-structured or unstructured data, what are eating concepts, what is Hadoop, what is spark hive, NoSQL databases, HBase and data warehousing. I have mentioned it in one-liner points because you just need to have very basic information like what it is, so that when you go ahead and deep dive into Big Data, these terms should not sound new to you. So just the basic idea of what these things are.

So let us say you start learning today. You can finish it in somewhat around one month. I have given one month because many of you I believe will be working with some of the other IT companies. so I'm not sure how much time you will be able to give it. However, yes, as far as these topics are concerned one month should be more than enough to get the basic idea. Then once you have the basic idea, you need to see what is Hadoop, once you get into the Big Data world, Hadoop should be one term that should be popping in your mind all the time. Therefore, you should have at least all this basic idea, which I have mentioned in these bullet points.

Hadoop architecture what is name node and data nodes, file system resource, managers cluster, resource planning, that means how you can plan a Hadoop cluster, important configuration files important processes and demons, high availability, and fault tolerance. So let us give near about 10 to 12 days for understanding these basic points. Then once you are done with this, the next topic will be ETL. You need to understand what ETL is and what its use is in big data. You need to see some important ETL concepts and important ETL tools. Let us give this near about three to five days.

See friends in this article I am just telling you the bullet points, that you should keep on your fingertips if you are going into the big data world. I am NOT going into the deep what ideal concepts will be, what ETL tools. I am just giving you a road map that will lead you to a good position in the big data industry. The next step would be you have to understand spark. Spark is one of the most used technology in the big data world. So you need to understand what is a spark, why spark required in big data, how are they related, you need to clear this up. Spark data structures and their importance. Spark has three data structures, so you need to have a basic idea of all of them. Then important spark libraries spark components like driver worker executor etc. important spark functions and spark program execution lifecycle. Therefore, for spark, it has a bit complicated so getting the basic idea itself will give it seven to ten days.

Then your next step would be to understand hive, what is a hive, why is hive required in the big data world, SQL. Then basic hive commands, performance, tuning concepts, which are most important tuning concepts like partitioning bucketing, then input-output file formats. Therefore, this will suffice for a basic idea in the hive. So let us give it three to five days so after understanding these concepts.

ETL Hadoop spark and hive you can say that you have achieved a beginner level than what next. Whatever concepts you have learned obviously in the programming world, you have to do a lot of hands-on. For which, I will tell you a very basic setup which is very easy to set up on your laptop or desktop and that will provide you basically all the tools which we have discussed so far and you can get your hands dirty on it. Therefore, for this, you have to download Oracle VirtualBox. Then once you have downloaded Oracle VirtualBox, the second step would be to download Cloudera QuickStart VM, this VirtualBox image would be somewhere around 150 to 200 MB but this Cloudera QuickStart VM will be somewhere around 5 to 6 GB. This Cloudera QuickStart VM includes big data tools like hive HBase Spark Scala Eclipse yarn zookeeper and a complete Hadoop setup.

This is based on Linux and OS operating system so for all this setup, you will not need more than one day and once you have all this set up as a beginner you can start doing your hands-on. First thing, you should practice the use case scenarios you will find most of the use case scenarios on Google, initially, you should practice like file input-output and HDFS how to load data in the hive, brush up some of your skills on Scala programming and basic data operations in SPARK. Therefore, friends for these hands-on use case scenarios will go on and on. I hope you like this article if you really did please hit that like button and please subscribe to our website. Thank you.

Big Data

Big Data Tools and Technologies

April 28, 2020
By Sanjeev Dubey
0 Comments

What is big data? Big data is extremely large or complex set of data and it is so large that it is difficult to process it using traditional database and software techniques. Every day we are creating approximately 2.5 quintillion bytes of data. So where is this huge amount of data being generated? from earlier we had mobile phones with the functionality of calling and text messages or clicking some pictures maybe but with the new technologies like smartphones we have a lot of applications for music, sports, social media like Facebook, Twitter, LinkedIn, and many more also.

Data is being generated when we shop online. So why does it need attention as the data is growing companies are capturing the data that streams into their businesses. They can apply analytics and get significant value from it with better speed and efficiency. Companies are leveraging the benefits of big data by analyzing the patterns and trends and predicting something useful out of it, for example, companies like Amazon and Netflix use big data to improve the customer experience as we see here from the statistics shown by 2020.

1.7 megabytes of data will be created every second for each human. This needs immediate attention because this data cannot be just thrown away it is going to give profit to the businesses big data challenges. Big data is not just about the volume of data it poses. Other challenges as well like velocity and variety as a volume 40 zettabytes of data will be created by 2020. this huge volume of data is either human-generated like from social media YouTube or candy machine-generated like through sensors and personal health trackers and canal so be generated with organizations like card details commercial transactions and medical records another challenge is the velocity the speed at which data is coming into the system.

The data needs to be processed with faster speed and then there is a variety of data is not only structured but also unstructured and semi-structured data like images videos and tweets. so how our enterprise is using this big data today. Let us see Big Data popular use cases Internet of Things. These are numerous ways in which analytics can be applied to the Internet of Things for example sensors are used to collect data. That can be analyzed to achieve actionable insights tracking customer or product movement etc. Many enterprises are creating a dashboard application that provides a 360-degree view of the customers that pulls data from a variety of sources analyzes it and presents it to customer service.

Therefore, this allows them to gather rich insights about businesses' big data. Popular use cases are related information security and data where have optimizations big data tools are being used to remove some of the burdens from the data warehouses. Even the healthcare industry is looking for patterns and treatments that lead to the best outcomes for patients. The main challenge of Big Data is storing and processing the data at a specified time span. The traditional approach is not efficient in doing that. Therefore, Hadoop technology and various big data tools have emerged to solve the challenges faced in the Big Data environment. Therefore, there are many big data tools and all of them help the user in some or another way in saving time money and uncovering business insights. These can be divided into the following categories like data storage and management.

The next broad category is data cleaning. Data needs to be cleaned up and well structured. Examples of such tools, which help in defining and reshaping the data into useable data sets, are Microsoft Excel. Data mining is a process of discovery in sites within a database some of the popular tools used for data mining are Terra data and rapid miner. Data visualization tools are a useful way of conveying and complex data insights in a pictorial way that is easy to understand. For example, Tableau and IBM Watson analytics and Plotly are the common tools for data reporting. Data ingestion is the process of getting the data into Hadoop, which can be done using scooped flume or storm data analysis requires asking questions and finding the answers and data.

The popular tools used for data analysis are Hive, Pig, MapReduce, and Spark. Data acquisition is also used for acquiring the data for which scoop flume or storm tools are quite popular. The popular Big Data tools offer many advantages, which can be summarized; as follows, they provide the analyst with advanced analytics algorithms and models. They help the user to run on big data platforms such as Hadoop or any high-performance analytic systems. They help the user to work with not only structured data but also unstructured and semi-structured data coming from multiple sources. It is quite easy to visualize and analyze data in a form that helps in conveying the complex data insights in a pictorial way, which is easy to understand by users big data tools help you to integrate with other technologies very easily.

Big Data

What is Big Data?

April 28, 2020
By Sanjeev Dubey
0 Comments

In this session, let us try to understand what big data is. Big data refers to the huge volume of data that cannot be stored and processed using the traditional approach within the given period. The next big question that comes to our mind is how huge this data needs to be in order to be classified as big data. There is a lot of misconception while referring to the term Big Data. We usually use the term Big Data to refer to the data that is either in Gigabytes or Terabytes or Petabytes or Exabyte or anything that is larger than this in size. This does not define the term Big Data completely even a small amount of data can be referred to as big data depending on the context it is being used.

Let me take an example and try to explain it to you for instance. If we try to attach a document that is of 100 megabytes in size to an email, we would not be able to do so as the email system would not support an attachment of this size. Therefore, these 100 megabytes of attachment with respect to email can be referred to as Big Data.

Let me take another example and try to explain the term Big Data. Let us say we have around 10 terabytes of image files upon which certain processing needs to be done. For instance, we may want to resize and enhance these images within a given period. Suppose if we make use of the traditional system to perform this task we would not be able to accomplish this task within the given period, as the computing resources of the traditional system would not be efficient to accomplish this task on time. Therefore, these 10 terabytes of image files can be referred to as big data.

Now let us try to understand big data using some real-world examples. I believe you all might be aware of some of the popular social networking sites such as Facebook, Twitter, LinkedIn, Google+, and YouTube. Each of these sites receives a huge volume of data on a daily basis. It has been reported on some of the popular tech blocks that Facebook alone receives around 100 terabytes of data each day. Whereas Twitter processes around 400 million tweets each day as far as LinkedIn and Google+ are concerned each of their sites receives tens of terabytes of data on a daily basis.

Finally coming to YouTube, it has been reported that each minute around 48 hours off lash videos are uploaded to YouTube you can just imagine how much volume of data is being stored and processed on these sites. However, as the number of users keeps growing on these sites storing and processing this data becomes a challenging task. Since this data holds a lot of valuable information. This data needs to be processed in a short span of time by using this valuable information. Companies can boost their sales and generate more revenue by making use of the traditional computing system. We would not be able to accomplish this task within the given period, as the computing resources of the traditional computing system would not be sufficient for processing and storing such a huge volume of data. This is where Hadoop comes into the picture we would be discussing Hadoop in more detail in the later sessions; therefore we can term this huge volume of data as big data.

Let me take another real-world example related to the airline industry and try to explain the term big data. For instance, the aircraft is while they are flying they keep transmitting data to the air traffic control located at the airports. The air traffic control uses this data to track and monitor the status and progress of the flight on a real-time basis. Since multiple aircraft would be transmitting this data simultaneously, a huge volume of data is accumulated at the air traffic control within a short span of time.

Therefore, it becomes a challenging task to manage and process this huge volume of data using the traditional approach. Hence, we can turn this huge volume of data into big data. I hope you all might have understood what big data is.

R Programming

R vs Python. What to learn in 2020?

April 26, 2020
By Sanjeev Dubey
0 Comments

Python and R are the two most commonly used languages in data science and nowadays, most of the fresher's get confused, whether they should use R or Python to kick-start their career in the field of data science domain.

I am gonna tell you the long and the short of both of these topics. So, without wasting more time, let's get started. I am gonna start off with their basic definitions:

Starting with R- R is a programming language made by statisticians and data miners for statistical analysis and graphics supported by the R foundation for statistical computing. R also provides high-quality graphics and it has some popular libraries, which help in analytical parts such as R Markdown and Shiny.

Python, on the other hand, is a fully-fledged, Object-oriented & high-level programming language made by programmers and developers' for general-purpose programming. Python is widely used in GUI based applications such as games, graphic designs, Web applications and many more.

So, we can say that R's functionality is developed by statisticians mind, thereby giving it a field-specific advantages while Python is often praised for being a general-purpose language with an easy-to-understand syntax.

Let us start from the first factor, that is speed. When it comes to speed, python is faster than R only till 1000 iterations but, after the 1000 iterations, R starts using the lapply function which increases its speed, in that case, R becomes faster than python. So, both have their own advantages. Right?

Moving forward to the next point: that is, Code and Syntax. In this topic, I am gonna give you a brief about the variable declaration, Data handling capacity with the scatterplot visualization and Plot graphics. Starting off with Variable Declaration. Let's take the case of String here. As R uses the similar implementation to that of the S programming language, which uses arrow signs in order to initialize the variable, which was also present in case of S programming language. These arrows can be used from right to left or left to right indicating whom to assign the variables whereas python uses an assignment operator to initialize the variables. Basically, R developers thought that it would be better to tell the direction of assignment rather than just using an assignment operator, which could actually confuse any new programmer about which variable is being assigned.

Next is the Data Handling capability, here, I am gonna show you the case of ScatterPlots, by which you will see the visualizations in R and python. These are the piece of codes in R and Python and after running these codes, you will get the very similar plot results in both the cases, if you check the code here, then this shows that how R data science ecosystem has many smaller packages like GGally, which basically is a package that helps ggplot2 and also, it is the most-used R plotting package, whereas in Python, matplotlib is the primary plotting package, and seaborn is a widely used layerover the matplotlib.

So, guys, these are the plot results that I was talking about, you can see that the graph results for both R and Python are similar, but the only difference is their visualization. So guys, based on these points and plot results, we can conclude that R has Many packages supporting different methods of doing things. Whereas there is usually one way to do something in python.

Moving on to the next point that is Graphics. Here we will take the case of ClusPlots. So Guys, as we already discussed that R was basically built for statistical analysis, so it has many specific libraries for plotting. This is the reason R comes up with beautiful charts and graphs whereas Python's main agenda was not a statistical analysis, so in the early stages of Python, packages for data analysis was an issue, but it has improved a lot.

Here is the plot result: As you know that a picture says more than a thousand words. Here You can see by yourself that R comes up with beautiful graphical representations. So here we can say that R is handy when it comes to Data Handling. Our next point of attention is Deep Learning, which is today's trend. As you all know, almost the majority of the companies are working on Artificial Intelligence, And Deep Learning is the main part of Artificial intelligence. So, When it comes to Deep Learning, Python is more versatile than R as it provides more features to deep learning whereas R is new to Deep Learning. R has newly added APIs like Keras and KerasR, which are written in Python. Right?

So now somewhere in your mind, this question might be floating why Keras? Actually, Keras in Python has the capabilities to run over python's strong APIs like tensorflow or Theano or Microsoft's CNTK. So we can say that Python has a greater advantage here. Till now, we have seen that both are useful in their own terms.

Now if we look at the Ease of Learning Point: Python is easy to start with as its languages are based on standardized format, i.e. people find it easy to read. It looks like you are reading English. R, on the other hand, is an unstandardized language. It is quite hard to learn as compared to Python. Beginners may find this hurdle in the starting. In the past years of research, the percentage of people switching from R to Python are more as compared to Python to R.

Let's say, if 10% people are switching from Python to R then, 20% are switching from R to Python, which is twice as compared to the before scenario. Next, we are gonna look at the trends, community support, and Jobs: Before 2016, R was more in use. But here we can see that from 2016, Python is in trend. So, it's more popular than R. And because of its popularity, it has overall good support for general purpose programming. Well if we talk about the community support, Then Python and R support aspects are almost similar as Python's support is found at: Mailing list, user-contributed code & documentation & StackOverflow. Basically, it has more adoption from developers & programmers end.

Whereas R language support is also found at: Mailing list, user-contributed documentation & active StackOverflow members. Basically, R has more adoption from researchers, data scientist and statisticians end. Now if we talk about Job trends, let's check the Google Job Trends graph right here, this is the Job postings for R and Python in past 12 months "WORLDWIDE" where python is asked more as compared to R. How is it possible? Because of its popularity and its need in the current industry. Since Python is more versatile and an all-rounder programming language which can be used for majority of the purposes such as web and application development, game development, artificial intelligence, data science, statistical analysis etc, whereas R language is used among statisticians and data miners for developing statistical software and data analysis. Which clearly depicts that, there are more jobs for python than R.

Now let's move forward! So, Which one to choose for Data Science R or Python? Guys, this the frequently asked question by the majority of the learners in this domain. I would suggest using both if you have the choice. They complete each other gracefully and will make your life better if you leverage their strengths and avoid their weaknesses. Everything has their own pros as well as cons, so as in the case of R and Python.

If we talk about pros in R, well, then R is great for prototyping and for statistical analysis. It has a huge set of libraries which are available for different statistical type analysis. Even RStudio IDE is definitely a big plus as it eases most of the tedious tasks and fastens your workflow. Talking about its cons, well The syntax could be obscure sometimes. And it is harder for it to integrate to production workflow. In my opinion, it is better suited for "consultancy-type" tasks. The libraries documentation isn't always user-friendly.

Talking about the pros in Python, Python is great for scripting and automating your different data mining pipelines. It is the de facto scripting language nowadays. And it also integrates easily in a production workflow. Besides, it can be used across different parts of your software engineering team (like for back-end, cloud architecture etc. The scikit-learn library in python is awesome for machine-learning tasks. Python (and its notebook) is also a powerful tool for exploratory analysis and presentations.

Talking of its cons Then python isn't as thorough for statistical analysis as R, but it has come a long way these recent years. In my opinion, the learning curve is steeper than R, since you can do much more with Python.

To conclude it, I'd like to that you can use R and Python both. Learn how they inter-operate together. Start with one and then add the other to your workflow. It only adds another skill-set into your resume, which comes as an added bonus to your career, Isn't it? So, guys, now it's a wrap time. Thank you so much for reading this article session. I'd love to hear from you guys that which one according to you is better and why?

Please reply to us in the comment section below.

R Programming

R Programming for Beginners. Why you should use R?

April 26, 2020
By Sanjeev Dubey
0 Comments

R Programming, R Language, Why R, Python & R, Best language for data science, best language for statistical analysis

So why is it that R is becoming such a popular and useful tool in data analysis and statistical analysis. I'm going to tell you why stay tuned now the short answer is, this is one of the rare occasions when something that is free and open-source is in fact better and that this is in my opinion better than the expensive commercially available alternatives that are out there and if you don't believe me just look at the trends.

There are masses of people moving from SPSS to R, from Stata to R, and from SS to R. I don't see anybody moving the other direction. Now R is essentially a programming language and you might find that fact a little bit intimidating or scary but don't and you'll see when I do the little

demonstration at the end of the article that it's not difficult to use. It's relatively intuitive, you can learn it, and there's loads and loads of support out there.

If you need it the importance of using code to it, when you do data analysis. Your analysis is reproducible somebody else can see exactly how it is that you going to the answers. That you have the ability to collaborate with other people and they can look at what you've done and make suggestions or changes or identify mistakes in your analysis. And you can't do that with a point-and-click system and the next reason why using code to do your analysis is important is that not only is your analysis reproducible but it's also repeatable in other words.

If a year from now you have additional data, let us say you had data for 2018 and 2019. You have double the data set; you want to re-run that analysis. You just run your code and everything your data cleaning, your data manipulation and your analysis. All is repeated right there and then at the push of a button now. One of the most exciting things about R is because it's open source. You have people all over the world writing packages and things that you can install and using it.

R deal with very specific data analytic problems and these are free and there are literally thousands of them. Another big advantage of using R is that it has incredible data visualization and graphics capabilities. In fact, in that sense it beats any other package without trouble. It is a slam-dunk nothing comes close right now. I am going to do a short demonstration just to show you that using a program language to do analysis is not difficult. it's not scary it's relatively simple okay so watch this so in this particular example I have got a little data frame called friends I click on that and we can see it over here we've got some variables and some observations I'm looking at age and height.

Let us see what we can do with those the way the coding works is you apply a function to an object. So in this case the function might be the mean, we want to know the mean of age [written as mean (age)], we might want to know the median of the height [written as median (height)], we can plot a histogram of the age [written as hist (age)] or plot age against height [written as plot (age ~ height)] and we might want to know if there is a statistically significant correlation between age and height [written as cor.test(age,height)] and as we can see in this particular case. Clearly writing code is not that scary at all. I mean I have not broken up into a sweat you will notice that I do not have a tremor. I do not have a heart palpitation. I have not fallen over dead. I have actually survived. It is not difficult. It is not scary.

You can do it now if you're interested in learning R and why don't you subscribe to this website to follow the articles. Thanks for reading, leave your comments and questions in the discussion section below.

Statistical Analysis

Choosing Which Statistical Test to Use.

April 26, 2020
By Sanjeev Dubey
0 Comments

You can use many different tests in statistics. Sometimes it can be quite difficult to know which the correct test to use is. This article will talk about seven tests you are likely to use. Involving means, proportions, and relationships. When you are trying to work out which is the most appropriate test.

You should ask three questions:

1. What level of measurement was used for the data we are analyzing?

2. How many samples do we have?

3. What is the purpose of our analysis?

I will now explain each of these questions

1). Data or level of measurement: Is our data nominal or interval/ratio?

Nominal data is also called categorical, qualitative or nonparametric. Examples of nominal data are color whether parts are defective or not, or preferred type of chocolate. Nominal summary values are usually stated as frequencies, proportions or percentages. The tests that involve nominal data are: Test for a proportion, Difference of two proportions and chi-squared test for independence. The other type of data is interval/ratio, also called quantitative. Examples of interval/ratio data are daily sales figures for coconut ties weight of peanuts or temperature. The most common summary value for interval/ratio data is a mean. Tests that involve interval/ratio data are: Test for a mean difference of two means - independent samples difference of two means – paired and regression analysis. Ordinal data can be classified with nominal or interval/ratio depending on the circumstances.

2). Samples: Next, we ask how many samples are involved.

Is there one sample for which we are testing the relevant statistic against a hypothesized value or are there two samples which are being compared with each other or is the one sample but each observation has a measure or score for more than one variable? The same sample is measured twice. If we wish to compare a proportion or a mean against a given value, this will involve one sample. If we're comparing two different lots of people or things such as men and women or people from two different departments then we would have two samples. If we have two sets of information on the same people of things, we would say we have one sample with two variables. An example is one set of days and information on how many coconut ties are sold and what the temperature was. Alternatively - one set of people and information on their gender and preferred type of chocolate.

3). Finally, we ask. What is the purpose of the analysis?

We can be tested against the hypothesized value comparing two statistics or looking for a relationship. Chi-squared test for independence and regression are similar in that they are looking at the relationship between two variables. The difference between them is in the kind of data. If you would summarize the data in a table, we would use a chi-squared test for independence whereas if you would put it on a scatter plot you would use regression analysis. Here is an example for each of these tests. They relate back or out other articles teaching about hypothesis testing. After each description of the scenario pause the article and see if you can identify the correct test before we tell you the answer. Helen is still selling coconut ties.

Example 1:

Sufficient nuts: Helen was concerned whether the number of nuts was sufficient in her coconut ties. She took a sample of twenty packets and found the weight of nuts in each packet.

1. Data: The weight was interval/ratio data.

2. Samples: There was just one sample of twenty packets of coconut ties.

3. Purpose: Helen was comparing against a given value

Thus, the test she needs to use is Test for a “Mean”.

Example 2: Prize tickets

In a promotional campaign, twenty percent of all packs of coconut ties should include tickets for free prizes. Helen takes a sample of fifty packets and finds that seven of them have winning tickets

1). Data: For each bar we are saying yes or no, only to be lumped whether or not there is a ticket. We get a sample proportion of seven out of fifty from this nominal data.

2). Samples: There is one sample of fifty packets

3).Purpose: Helen is comparing the sample value against a given value: twenty percent.

We conclude that the test she needs to use is tested for a “Proportion”.

Example 3: Bar longevity compared with nut-bars.

Helen thinks her coconut ties last longer than the competition, nut-bars. She gets 36 people to eat one of each and records their eating times.

1). Data: Helen collects time taken in seconds so this is interval/ratio data.

2). Samples: There is one sample of thirty-six people but with two scores for each person the time for the coconut tie and the time for the nutbar.

3). Purpose: She is looking at whether there is a difference in the amount of time taken for each of the bars.

Thus the test is “difference of two means, paired sample”.

Example 4: Defective wrapping from two wrapping machines

Helen thinks there is a difference in performance between the two wrapping machines in her factory. She checks 200 bars from one machine and 150 bars from the other. For each bar, she is seeing if the wrapping is satisfactory or not. She finds that ten out of two hundred bars from the first machine and nine out of 150 bars from the second machine are badly wrapped.

1). Data: The information for each bar is OK or not ok. This is nominal data. It has been summarized as frequencies.

2). Samples: there are two independent samples one sample from each of the two machines

3). Purpose: Helen is comparing the proportions of the two samples.

We can see that the test is the “difference of two proportions”.

Example 5: Do stickers help sales?

Helen is exploring whether having free stickers makes a difference in sales. She has the sales figures for thirteen days when she did offer free stickers and ten days when she did not.

1). Data: For each day Helen has a number or value corresponding to the sales for that day. This is interval/ratio data. It is summarized as a mean member of sales.

2). Samples: There are two samples one sample for days with stickers and one sample for days without.

3). Purpose: Helen is comparing the average sales figures for the two treatments.

We conclude that the test to use is...”Difference of two means independent samples”.

Example 6: Are sales affected by temperature?

Helen wants to see if there is a relationship between the daily temperature and sales of coconut ties. She has data on sales and temperature for thirty weekdays of sales.

1). Data: Sales and temperature at both interval variables.

2). Samples: There is one sample of thirty days with two measures or scores for each day.

3). Purpose: Helen is interested in the relationship between sales and temperature.

This leads us to decide that the test is “Regression”.

Example 7: Men, women, and chocolate preference.

Helen is thinking of selling dark chocolate, milk chocolate, and white chocolate coconut ties. She thinks that men and women might have different preferences with regard to type. She collects data from fifty customers, noting down if they are men or women and asking them which variety they prefer.

1). Data: Helen records the type of chocolate and sex of a person. These are both nominal variables.

2). Samples: There is one sample of fifty customers but with two measures or variables.

3). Purpose: Helen is looking at whether there is a relationship variable.

Thus the test is “chi-squared test for independence”.

Those are seven examples of the seven tests outlined here. There are numerous other statistical tests and other things may need to be considered, but this summary will help you to understand what these seven basic tests do and what to look for when deciding on which test to choose.

Statistical Analysis

Is Data Analysis involves Statistical Process?

April 26, 2020
By Sanjeev Dubey
0 Comments

Statistics, Statistical Modelling,Statistical Analysis

We know what statistics is, I want to spend some time talking about the process of doing statistical analysis. The first step in doing any kind of statistical reporting is determining what it is you want to know. This may seem like a trivial step but it's actually one that professional pollsters and researchers invest a lot of time in.

Pollsters, for example, want to make sure that their questions don't have any bias, that is that they don't accidentally suggest an answer. For example, a question like, "If you knew that Senator Jones had been indicted for tax evasion, would you be more or less likely to vote for him?"

The first part of the question establishes the connection between the senator and crime in the minds of the person being asked the question that's going to indirectly influence their opinion and therefore their answer.

The second step in the process is determining the source of your data.

It often isn't possible to survey an entire population, even small states, for example, have populations in the hundreds of thousands, so you'll need to limit your research to a sample or subset of the whole population.

Coming up with a method for getting a sample that represents an entire population is the second major hurdle faced by statistical researchers. This is another area where they have to take great care to avoid introducing any kind of bias in the selection process. For example, if you wanted to get information about voter preferences in Michigan, it would be easy just to take your sample from people living in cities because they're easy to get to. The problem with that approach is that the opinions of people living in rural areas outside of large cities may be very different from the opinions of people living in large cities and those opinions with representing your results. Once you've collected your data, you're ready to start analyzing it.

What I've referred to here as "exploring" the data can mean a lot of different things. You could sort the data and look for outliers, that is to say, values that are unusually large or small. You can look to see if the values are evenly distributed or if they form groups or clumps. You could try creating graphs from the data to get a more visual perspective on the results. There are a variety of things you can do here, many of which we'll be talking about in later lectures. Now that we have a basic understanding of the data, we're ready to make specific choices about how the results should best be presented. That might be by looking at statistics like the average and standard deviation, with a chart or graph, or it might be with something called a frequency distribution.

At this point, the direction you go is going to be determined by a combination of the researcher's experience presenting results and the expectations of the people who are going to actually be using those results. Finally, you'll want to decide whether or not the results are statistically significant. For example, your sample might tell you that 51% of the voters prefer Candidate A over Candidate B but is that result significant? Do you have enough confidence in the results to say that Candidate A really is the people's preference?" Confidence" may sound like a kind of vague word but there are statistical techniques that will let us express our "level of confidence" in very precise, numeric ways.

Now that we have this process in place, you should keep in mind that it's going to be used in different ways in different situations. For example, a financial analyst may want to see the average daily sales for a company's office for the last month. The analyst has to define a specific question, i.e. "What are the office's average daily sales?" but the sampling method is already determined by the question - she's going to want all of the sales figures for the specific office. There really isn't any need to explore the data because the business has defined which numbers they're interested in. The managers of this company have decided that average daily sales are a number that's important to them and that's what they want to see in their reporting.

In business situations, you'll often omit the last step where you look for statistical significance. Corporate managers and executives will usually apply their personal judgment and experience in interpreting numeric results rather than trying to use specific statistical techniques. Now that we have a process in place, we're ready to start looking at the ways that the specific steps are implemented starting with determining our population and a sample. Thank You.

Pages

How to make your career in Big Data Tools?

Big Data Tools and Technologies

What is Big Data?

R vs Python. What to learn in 2020?

R Programming for Beginners. Why you should use R?

Choosing Which Statistical Test to Use.

Is Data Analysis involves Statistical Process?

About me

Popular Posts

Find us on Facebook

Pinterest

Labels

recent posts

Blog Archive

Instagram

Follow me

Socialize!