Statistical Techniques for Data
Analytics
Assignment
SQLite & dplyr in R
Introduction: SQLite is an open source, all inclusive
SQL-based database system in a single file.
Specifically, it does not require a separate server (i.e.
server-less), but instead the entire database
engine is integrated into an application that needs to
access a database. In addition, SQLite
packages the entire database into a single file, within
which the database layout and the actual
data held (in all the different tables and indexes) are
contained. As with all RDBMS, all interaction
with a SQLite based system is carried out through the SQL
language. In R, both the RSQLite and
sqldf packages make use of the integrated DataBase Interface
to access the constructed system1.
The dplyr package developed by RStudio is an R-based package
that is designed to provide a
highly optimised set of routines specifically for dealing
with data frames. The latter is a
particularly important data structure in statistics and in
R2, where several RDBMS such as SQLite
described above also implement such a structure for data
manipulations.
This assignment is
divided into two parts - Parts I and II. Part I concerns the use of
SQLite and dplyr on a
dataset available at,
https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/censusincome.data.gz
and perform a number
of tasks as specified in the next section (under Tasks). In part II, you are
required to discuss
in a technical report with approximately 2000 words (figures, tables and
appendix excluded)
which compares and evaluates the use of the two packages based on your
work in Part I.
Part I Tasks (60%)
Download the Census
Income data set from the above link and unzip/extract the data file onto a
directory in your own
filesystem.
1. Create a SQIite
database called census_income in R and a table named Income defined with
appropriate column
(attribute) names and data types as provided in the Appendix of this
document.
2. Add a column with
the name SS_ID to the Income table. Fill this column with consecutive
numbers starting from
1 for the first row. Make the SS_ID attribute the primary key of the
Income table.
3. Construct SQL
queries that provide the total number of males and females for each race
group reported in the
data. The result should show for example how many white females,
white males, black
males etc. are included into the dataset.
Get Free Quote!
325 Experts Online