Any data analyst that uses computer programs or coding languages to do their work should know about the different data types that can be recognized by computers. These are important to know and differentiate between because only certain data types are compatible with certain types of functions. For example, you can use a SUM function to add two numbers together but not to find the sum of two text strings (say, “cat” and “dog”).
Data type classification sometimes varies between programming languages. For example, the data type Double, which is used in Java and C++, has no equivalent in Python. Below we’ll cover the major data types used in programming and analytics and discuss differences across languages.
Primitive vs Composite Data Types
Before we dive into the specific data types, it’s important to understand the difference between primitive data types and composite data types. Primitive data types are built into a programming language with no extra definition needed on the part of the programmer. Composite data types, on the other hand, are formed using primitive data types as building blocks. For example, a List is a composite data type that contains multiple elements of data. You can have a list of integers, a list of text strings, or even a list of lists! That is to say, composite data types can be formed from primitive data type elements or other composite data type elements.
A string is made up of multiple characters strung together in a particular order. This data type is often used for sentences, words, and phrases. It can be used with numerical characters, though mathematical functions cannot be performed on this data type. The length of a string is defined as the number of characters in it (including blank spaces). Most programming languages use this data type, though in C Programming it is not built-in. Instead, in C, strings are formed as an array of characters.
“The quick brown fox jumps over the lazy dog.”
The integer data type is a primitive data type available in most programming languages. An integer is a positive or negative whole number. Some languages have limits on how large or small a number can be represented as an integer. In Java, for example, only whole numbers between -2,147,483,648 and 2,147,483,647 can be represented with this data type. Numbers outside of this range are represented as another data type. In contrast, Python accepts any whole number as an integer.
Like an integer, a long data type is also used for numerical data. It is available in most languages that have a limited range of integers, such as Java and C. In Java, the long data type can represent whole numbers between −9,223,372,036,854,775,808 and +9,223,372,036,854,775,807. It is used mostly for dealing with numbers outside of the range allowed by the integer data type. Long is not used in languages like Python where large numbers can be represented as integers instead.
Languages like Java, C++, and C use the long long (C, C++), biginteger (Java, Go), or bignum (Ruby) data type for integers that cannot even be represented as Long. Python does not use this data type or the long data type since the integer data type can represent any whole number.
Most languages that support long data types also support a data type called short, which is used for whole numbers within a very narrow range–between -32,768 and +32,767 for Java and C. Any number expressed as a short can just as easily be expressed as an integer, so this data type is not frequently used. However, a short can use less memory (measured in bytes) than an integer and so, is used in programs that must work with minimal memory.
Float, short for “floating point number,” is another data type used for numerical values. Unlike all of the types we have already discussed, float values can be non-whole numbers. In some languages, the equivalent to a float is called a single, short for “single precision floating point number.”
A double is also a floating point number, but it can hold twice as much memory as a float and so, can be very precise. In most languages, a float number can have up to seven significant digits, while a double can have up to fifteen. In Python, there is only one floating data type, called float, but it has the same level of precision as the double data type used by other languages.
A boolean is a true/false data type. It is often used in writing conditional IF expressions. In Python, the boolean type is called bool and holds a numeric value of 1 (for true) or 0 (for false).
An array is a collection of multiple elements. This makes it a composite (rather than primitive) data type. The exact definition of an array does vary slightly between languages. In most cases, elements are ordered by index. In some languages, an array can be either single-dimensional (like a list) or multi-dimensional (like a spreadsheet with rows and columns). C++ uses another data type called a vector for arrays that do not contain a fixed number of elements.
Choosing a Data Type
Sometimes it is very obvious which data type should be used for particular applications in programming. For example, the true/false Boolean data type is a clear choice for indicating whether a device is turned on or off. Other times, the choice is not so obvious. For instance, when working with numbers, how do you choose whether to use an integer or a float? Consider the following:
- Will the data type work for all elements? For example, if you are working with a series of numbers that are mostly whole numbers you may be tempted to use the integer data type. However, if one or two elements are decimals you will be better off using a data type that allows for these values like a float or double?
- What kinds of functions will be used with the data? Most functions require a certain data type as an input. For instance, you cannot use a sum() function on string elements like “two” and “three” but you can use it to add integers like 2 and 3.
- Different data types use different amounts of memory. If you are trying to conserve memory, it is best to use lower-memory data types when possible. Additionally, if your program will use vast amounts of data, the use of high-memory data types can add up to be excessive. Generally, more constrained data types use less memory. For instance, a short uses more memory than a long and a double uses more than a float.
You can also learn even more about data analytics from our comprehensive Guide to Data Analytics.