The allure of bliss returns once more, calling me to bed to warmly snore.
Data refers to value generated by things of interest, be it people, events, or objects. For example, texts in posts, numerical values, pictures, and videos are all important data when discussing social networking sites.
On the other hand, features are data converted through domain knowledge, mathematical functions, etc. into representative characteristics that can more tangibly present the traits of things of interest.
Computers only understand numbers and can only perform calculations based on them. Therefore, all information from a human perspective needs to be converted into numbers.
In order to represent data in the form of numbers, the process of generating features requires converting different types of data (text, pictures, etc.) into numerical values through various methods.
Continuous feature: variables that can be used to perform mathematical equations with. Includes equidistance and equiproportional variables.
Equidistance variable: a variable in a data set that changes by the same value in each step along the range. An example of this is temperature in degrees Celsius (or Fahrenheit), as the difference between 10°C and 11°C is the same as the difference between 20°C and 21°C (1°C).
Equiproportional variable: a variable that has a proportional change, rather than a constant difference, in each step along the range. An example of this is exponential decay of radioactive materials decay exponentially, as they lose a constant proportion of their radioactivity over time, not a constant amount.
Categorical feature: variables that cannot be used to perform mathematical operations with. Includes nominal and ordinal variables.
Nominal variable: a categorical variable that simply names or labels different groups or classifications, with no inherent order between the categories. An example of this is sorting objects by color without implying any one is better or worse than another.
Ordinal variable: a categorical variable that has a natural order or ranking among the categories. An example of this is ranking students in class – 1st, 2nd, 3rd – where the order is clear.
DataFrame is a Python suite that specializes in processing tabular data, with functions similar to Excel. It is mainly composed of rows and columns. In the table, they are called "Index" and "Column Name" respectively.
Row: a set of info read horizontally.
Column: a set of info read vertically.
Index: a unique identification code of each row in the data table.
Column Name: a name of each column in the data table.
In Python, if the value of a column is extracted from a DataFrame, it will be presented in the form of series. In a series, each piece of data will also have an independent index value. When outputted through Python, you can view the name of the series and the data type of the content.
I have practiced with creating 1D and 2D lists before, though have not tried converting them into DataFrames before. For the following code to work, install and import the pandas module.
In the context of computer performance, extracting data from a series is faster than extracting data from a list. This is because a series is a 1D array, which is easier to access than a DataFrame/list, which is a 2D array.
.loc is used to access a group of rows and columns by labels or a boolean array. When you want to select more than two indexes, you can use the list method.
Compared to .loc, .iloc uses integer positions to select data, while .loc uses labels. In terms of performance, .loc is more flexible since labels can be changed, but .iloc is faster because the integer system is simpler.
The pandas module's read_csv() statement converts any CSV file within the bracket. Be mindful of the following when extracting CSV files:
File path or buffer: the path to the file on the computer. Pay attention to issues of relative and absolute paths.
<sep>: what matches are used to separate different data in this file. Most CSV files use "," to separate data, so this parameter does not need to be specified additionally.
Header: the row in the CSV file that displays the column name. It is marked at the top (0th position) of most CSV files, so most of this parameter does not need to be specified additionally.
If you want to read Excel files, simply change "csv" in read_csv() to "excel" instead.