In this lab you will complete some hands-on exercises in Python. In many situations, when you carry out ML processes, you work with datasets that are stored in arrays and/or data frames.

computer science

Description

Arrays and Data Frames

In this lab you will complete some hands-on exercises in Python. In many situations, when you carry out ML processes, you work with datasets that are stored in arrays and/or data frames. Python offers the standard set of data types, but also the data frame, which truly gives it power. Data frames are really an all-in-one combination of a database table, matrix, 2D array, and pivot table with many additional time-saving features.

 

Much like a database table, each column in a data frame has a column name and holds elements of the same type of data. You can perform operations on whole columns, rows, or subsets of each. Adding, merging, flattening, expanding, changing, deleting, and searching for data using one-line operations. There are also methods to read and write the contents of data frames to and from files. In essence, Python achieves this expressive power by putting intelligence into the data structure and the functions that operate on them. In contrast, other programming languages have less sophisticated data structures, meaning you need to write your own code and create your own data structures to achieve similar results.

 

The data frame is the core data structure you will find yourself using for most of the data analytics projects. It lets you focus on what you want to do with the data versus how to do it. Arrays are also important and you need to know how to create and manipulate them. At times, you want to create a data frame from an array.

 

Section A: Numpy Arrays

Numpy is the extension package in Python for multi-dimensional arrays designed for scientific computation and it is very efficient. Launch Spyder and then use the console to complete sections 1.3.1 to sections 1.3.3 at:


The same practical exercises can be found on pages 47 to 76 in the ‘Scipy Lecture Notes’ book in PDF form attached in ‘CourseResources’ section of Canvas.

 

You do not have to complete every step, if you understand how a feature works or is used, skip it and move to the next step.

 

Section B: The Pandas data-frame

Some steps of the hands-on exercise from this section will be submitted for grading.

 

Data frames in Python are implemented in the Pandas extension package. Data frames contain named columns, can contain a mixture of different data types by column, which is what we need because data samples have features of different data types.

 

The file payment_fraud.csv contains comma-separated data samples of payment transactions on various dates. The features of this dataset are: ‘accountAgeDays’,   ‘numItems’, ‘localTime’, ‘paymentMethod’, ‘paymentMethodAgeDays’, and ‘label’. The last column is a label which indicates if a transaction is fraudulent (1) or not (0).  This dataset is used in the ML classification example appearing in chapter 2 of the course textbook (page 27).

 

We will use this dataset to learn various ways in which you can manipulate data in a data frame. You need to copy the file payment_fraud.csv into you current working directory or include the full path of the file in line 12. Use the Python code in the file DataFrameExample2.py to explore this data. You should run the code on each line individually starting at line 7. Feel free to explore this data set using Python in other ways. You should open payment_fraud.csv using Excel and compare the results you get after you execute each line of code and the contents of the Excel spreadsheet.

 

1.      Copy and paste the output you get from the following lines 19, 22, 29, 47, 55, 68, and 73 into a Word document.

2.      Provide code that returns all samples whose payment method was conducted using ‘paypal’.

3.      Provide code for obtaining fraudulent transactions samples whose payment method was conducted using ‘paypal’.

 

Submit the Word document containing your answers to 1, 2, and 3 into the dropbox for lab 2.

 


Related Questions in computer science category