BIGDATA MATERIALS FOR BCOM CA
Big data refers to extremely massive and intricate data collections that are difficult to manage, handle, or analyse using conventional data processing methods. It includes information that falls under the "Three V's" of volume, velocity, and variety.
Volume: The term "big data" describes data sets that are incredibly huge in size and are processed more quickly than by traditional database systems. Terabytes, petabytes, or even exabytes of data can be found in these data collections.
Velocity: Big data is continuously generated quickly and enters businesses from a variety of sources, including social media, sensors, online transactions, and more. Since the data is produced in real-time or very close to real-time, it must be processed and analysed quickly.
Structured, semi-structured, and unstructured data are only a few of the several formats and types of big data. It contains text, pictures, videos, audio, posts to social media, log files, and other things. It is difficult to store, handle, and interpret this variety of data in an efficient manner.
Due to technological advancements and the increasing digitalization of numerous businesses, big data has become increasingly important in recent years. Businesses in a variety of industries, including manufacturing, banking, healthcare, retail, and telecommunications, are utilising big data to gather insightful information, make informed decisions, increase operational effectiveness, and improve consumer experiences.
Organisations use specialised hardware and software, including distributed computing frameworks like Apache Hadoop and Apache Spark, NoSQL databases, data lakes, data warehouses, and advanced analytics methods like machine learning and artificial intelligence, to manage big data successfully.
Big data has enormous promise, but it's vital to remember that there are drawbacks as well. These difficulties include issues with data quality and veracity, data privacy and security, data governance, scalability, and the requirement for qualified data specialists to draw valuable conclusions from the large amount of data at hand.
THE FOLLOWING CONTENT ARE AVAILABLE IN NOTES
Introduction to Big data
Data, classification Of Digital Data--structured, unstructured, semi-structured data, characteristics of data, evaluation of big data, definition, and challenges of big data, what is big data and why to use big data?
business intelligence Vs big data.
1. Data:
In the pursuit of knowledge, data is a collection of discrete values that convey information, describing quantity,
quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. A datum is an individual state in a set of data.
Digital Data Classification: Process of classifying data in relevant categories so that it can be used or applied more efficiently. The classification of data makes it easy for the user to retrieve it. Data classification holds its importance when comes to data security and compliance and also to meet different types of business or personal objective. It is also of major requirement, as data must be easily retrievable within a specific period of time.
2. Types of Digital Data Classification:
Data can be broadly classified into 3 types.
1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format. The elements in structured data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in a tabular format. Today, most of the data is developed and processed in the simplest way to manage
information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students for a university like the name of the student, ID of a student, address, and Email of the student. To store the record of students used the following relational schema and table for the same.
S_ID S_Name S_Address S_Email
1001 A Delhi A@gmail.com
1002 B Mumbai B@gmail.com
2. Unstructured Data:
It is defined as the data in which is not follow a pre-defined standard or you can say that any does not follow any organized format. This kind of data is also not fit for the relational database because in the relational database you will see a pre-defined manner or you can say organized way of data. Unstructured data is also very important for the big data domain and To manage and store Unstructured data there are many platforms to handle it like No-SQL Database.
Examples –
Word, PDF, text, media logs, etc.
Semi-Structured Data:
Semi-structured data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in a relational database but is very hard for some kind of semi-structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification:
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available to the users. So it’s basic features as following.
• Homogeneity – The data items in a particular group should be similar to each other.
• Clarity – There must be no confusion in the positioning of any data item in a particular group.
• Stability – The data item set must be stable i.e. any investigation should not affect the same set of classification.
• Elastic – One should be able to change the basis of classification as the purpose of classification changes.
3. Five Characteristics Of Good Quality Data!
One of the most important things to always remember is that not all data could be considered of fine quality hence making them limited in their usefulness. In order to fully realize the benefits of data, it has to be of high quality. This means that one should look out for certain characteristics in the data. These are:
1. Data should be precise which means it should contain accurate information. Precision saves time of the user as well as their money.
2. Data should be relevant and according to the requirements of the user. Hence the legitimacy of the
data should be checked before considering it for usage.
3. Data should be consistent and reliable. False data is worse than incomplete data or no data at all.
4. Relevance of data is necessary in order for it to be of good quality and useful. Although in today’s world of dynamic data any relevant information is not complete at all times however at the time of its usage, the data has to be comprehensive and complete in its current form.
5. A high quality data is unique to the requirement of the user. Moreover, it is easily accessible and could
be processed further with ease.
4. What is big data?
Big data refers to data that are so large and complex that traditional methods of collection and analysis are not possible. The amount and variety of big data has increased exponentially over the past decade.
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of today's data has been generated in the past 3 years.
Sources of Big Data
These data come from many sources like • Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of users worldwide.

Comments