Statistical Techniques for Data Analytics SQLite is an open source, all inclusive SQL-based database system in a single file.

computer science

Description

Statistical Techniques for Data Analytics

 Assignment

SQLite & dplyr in R

Introduction: SQLite is an open source, all inclusive SQL-based database system in a single file.

Specifically, it does not require a separate server (i.e. server-less), but instead the entire database

engine is integrated into an application that needs to access a database. In addition, SQLite

packages the entire database into a single file, within which the database layout and the actual

data held (in all the different tables and indexes) are contained. As with all RDBMS, all interaction

with a SQLite based system is carried out through the SQL language. In R, both the RSQLite and

sqldf packages make use of the integrated DataBase Interface to access the constructed system1.

The dplyr package developed by RStudio is an R-based package that is designed to provide a

highly optimised set of routines specifically for dealing with data frames. The latter is a

particularly important data structure in statistics and in R2, where several RDBMS such as SQLite

described above also implement such a structure for data manipulations.

This assignment is divided into two parts - Parts I and II. Part I concerns the use of

SQLite and dplyr on a dataset available at,

https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/censusincome.data.gz

and perform a number of tasks as specified in the next section (under Tasks). In part II, you are

required to discuss in a technical report with approximately 2000 words (figures, tables and

appendix excluded) which compares and evaluates the use of the two packages based on your

work in Part I.

Part I Tasks (60%)

Download the Census Income data set from the above link and unzip/extract the data file onto a

directory in your own filesystem.

1. Create a SQIite database called census_income in R and a table named Income defined with

appropriate column (attribute) names and data types as provided in the Appendix of this

document.

2. Add a column with the name SS_ID to the Income table. Fill this column with consecutive

numbers starting from 1 for the first row. Make the SS_ID attribute the primary key of the

Income table.

3. Construct SQL queries that provide the total number of males and females for each race

group reported in the data. The result should show for example how many white females,

white males, black males etc. are included into the dataset.


Related Questions in computer science category


Disclaimer
The ready solutions purchased from Library are already used solutions. Please do not submit them directly as it may lead to plagiarism. Once paid, the solution file download link will be sent to your provided email. Please either use them for learning purpose or re-write them in your own language. In case if you haven't get the email, do let us know via chat support.