Arrays and Data Frames
In this lab you will
complete some hands-on exercises in Python. In many situations, when you carry
out ML processes, you work with datasets that are stored in arrays and/or data
frames. Python offers the standard set of data types, but also the data frame,
which truly gives it power. Data frames are really an all-in-one combination of
a database table, matrix, 2D array, and pivot table with many additional
time-saving features.
Much like a database
table, each column in a data frame has a column name and holds elements of the
same type of data. You can perform operations on whole columns, rows, or
subsets of each. Adding, merging, flattening, expanding, changing, deleting,
and searching for data using one-line operations. There are also methods to
read and write the contents of data frames to and from files. In essence,
Python achieves this expressive power by putting intelligence into the data
structure and the functions that operate on them. In contrast, other
programming languages have less sophisticated data structures, meaning you need
to write your own code and create your own data structures to achieve similar
results.
The data frame is the
core data structure you will find yourself using for most of the data analytics
projects. It lets you focus on what you want to do with the data
versus how to do it. Arrays are also important and you need to know
how to create and manipulate them. At times, you want to create a data frame
from an array.
Section A: Numpy Arrays
Numpy is the extension
package in Python for multi-dimensional arrays designed for scientific
computation and it is very efficient. Launch Spyder and then use the console to
complete sections 1.3.1 to sections 1.3.3 at:
The same practical
exercises can be found on pages 47 to 76 in the ‘Scipy Lecture Notes’ book in
PDF form attached in ‘CourseResources’ section of Canvas.
You do not have to complete
every step, if you understand how a feature works or is used, skip it and move
to the next step.
Section B: The Pandas
data-frame
Some steps of the hands-on exercise from this section will be submitted
for grading.
Data frames in Python
are implemented in the Pandas extension package. Data frames contain named
columns, can contain a mixture of different data types by column, which is what
we need because data samples have features of different data types.
The file payment_fraud.csv contains comma-separated
data samples of payment transactions on various dates. The features of this
dataset are: ‘accountAgeDays’, ‘numItems’,
‘localTime’, ‘paymentMethod’, ‘paymentMethodAgeDays’, and ‘label’. The last
column is a label which indicates if a transaction is fraudulent (1) or not
(0). This dataset is used in the ML
classification example appearing in chapter 2 of the course textbook (page 27).
We will use this dataset
to learn various ways in which you can manipulate data in a data frame. You
need to copy the file payment_fraud.csv
into you current working directory or include the full path of the file in line
12. Use the Python code in the file DataFrameExample2.py
to explore this data. You should run the code on each line individually
starting at line 7. Feel free to explore this data set using Python in other
ways. You should open payment_fraud.csv
using Excel and compare the results you get after you execute each line of code
and the contents of the Excel spreadsheet.
1.
Copy
and paste the output you get from the following lines 19, 22, 29, 47, 55, 68,
and 73 into a Word document.
2.
Provide
code that returns all samples whose payment method was conducted using
‘paypal’.
3.
Provide
code for obtaining fraudulent transactions samples whose payment method was
conducted using ‘paypal’.
Submit the Word document
containing your answers to 1, 2, and 3 into the dropbox for lab 2.
Get Free Quote!
385 Experts Online