Assignment 1 « SQL »

Due: 11:59pm 06/01/2021 ET

SQL Joke

Setting Up

Getting the Stencil

You can click here to get the stencil code for Homework 1. Reference this guide for more information about Github and Github Classroom.

The data is located in the data folder. To ensure compatibility with the autograder, you should not modify the stencil unless instructed otherwise. For this assignment, please write each of your queries to its corresponding SQL file. Failing to do so may hinder with the autograder and result in a low grade.

SQLite

If you are working locally, you can check if SQLite is installed already by running sqlite3 -version in your terminal. You can refer to this guide or this guide to install SQLite and learn how to use SQLite on your machine. SQLite is installed on all department machines and by default on Mac. It can be accessed from the command line using sqlite3.

Running sqlite3 somedb.db from your terminal will launch an environment that will allow you to type your SQL queries directly into the terminal. You can exit this environment by pushing Ctrl+D or by typing .exit and pressing enter.

As a more explicit example, to open a sql environment where you can query the movies.db database, you can type:

$ sqlite3 movies.db

To execute a SQL statement that you have saved in a solution file, you can run the following command:

$ sqlite3 movies.db < sql_solutions.sql

For more information on using SQLite from the command line, see http://www.sqlite.org/sqlite.html. Additionally, we have provided very helpful hints for most of the problems; you should be able to use these as a starting point if you get stuck before looking up additional information online.

Additionally or alternatively, there are really powerful SQLite tools out there that you can use, especially if you prefer a more graphical way of interacting with SQLite. Check out the section below for more details.

Database

Some useful tools you can use to view the content in a database: SQLite Viewer (a web application) and SQLTools + SQLite packages (if you are developing on VS Code). Additionally, you can install and use SQLite extensions on your internet browser (e.g., SQLite Manager for Chrome or for Mozilla)

Part 1: Starting Off!

20 points

This part of the assignment builds off the exercises you completed in the lab. If you have not yet completed the lab, please do so before starting this assignment. There are some really useful hints and examples you can borrow from the lab for this assignment. The database and schema are described again below, but are the same from the lab.

We have provided a database named people.db with the name, age, ID, and occupation of some Brown students and alumni. Here is the schema:

people_main(ID INTEGER, name TEXT, occupation TEXT, age INTEGER)
people_likes(ID1 INTEGER, ID2 INTEGER)
people_friends(ID1 INTEGER, ID2 INTEGER)

In the people_main table, ID is a unique identifier for a particular student or alumni. name, occupation and age correspond to the person's first name, occupation and age.

In the people_friends table, each (ID1, ID2) pair indicates that the particular person with ID1 is friends with the person with ID2 (and vice versa). The friendship is mutual, and if (ID1, ID2) is in the table, it is guaranteed that (ID2, ID1) exists in the table.

In the people_likes table, each (ID1, ID2) pair indicates that the student or alumni with ID1 likes the person with ID2. The (ID1, ID2) pair in the table does not guarantee that the (ID2, ID1) pair also exists in the table.

Your job is to write SQL queries for the data being requested:

(4 points) Write a SQL statement that returns the name and number of people that like each person. Results should be ordered by count (descending), and then by name (A-Z). Save the query to part1_problem1.sql
Hint: Use a LEFT JOIN! The following website is quite useful: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
(4 points) Write a SQL statement that returns the two occupations that have the lowest count of popular people. Consider popular people as people who are liked by at least one other person. Return the two occupations and the counts. Results should be ordered by occupation (A-Z). Save the query to part1_problem2.sql
Hint: The LIMIT statement will come in handy!
(4 points) Write a SQL statement that returns the name and occupation of all people who have more than 3 friends. Results should be ordered by name (A-Z). Save the query to part1_problem3.sql
Hint: You'll need to take a look at the HAVING function.
(4 points) Write a SQL statement that returns the distinct name and age of all people who are liked by anyone younger than them. Results should be ordered by name (A-Z). Save the query to part1_problem4.sql.
(4 points) Write a SQL statement to find pairs (A, B) such that person A likes person B, but A is not friends with B. The query should return 4 columns: ID of person 1, name of person 1, ID of person 2 and name of person 2. Results should be ordered by ID1 (ascending), then ID2 (ascending). Save the query to part1_problem5.sql

Time to join stuff!

SQL Joke

Part 2: Getting Harder!

60 points

For this part of the assignment, you will be using the TMDB Movie Dataset, which has been exported to the movies.db database. The database schema is as follows:

movies(budget INTEGER, homepage TEXT, id INTEGER, original_language TEXT, original_title TEXT, overview TEXT, popularity REAL, release_date TEXT, revenue REAL, runtime INTEGER, status TEXT, tagline TEXT, title TEXT, vote_average REAL, vote_count INTEGER)
scores(review TEXT, min_score INTEGER, max_score INTEGER)

We encourage you to use the WITH operator, which lets you divide your query into separate queries. As an example, we can define a subquery and use it in another query as follows (there is also an example in the lab!):

WITH subquery AS (
SELECT
  original_title, vote_average
FROM
  movies
)

SELECT
  original_title
FROM
  subquery
);

(10 points) Write a SQL query to find the original_title, budget and release_date of the movie "John Carter" and append to that the movie original_title, budget and release_date of the movie that was released 9 days after "John Carter".
You can add days to a particular day by using the date function. For example, in order to add 3 days to to '2012-07-16', you can use date('2012-07-16', '+3 days')

Hint: The UNION statement should come in handy.

(10 points) Write a SQL query to count the number of movies that start with "The", end with a "2" or contain the word "shark". Your query should be case insensitive and return one column with one entry. You should return a single value.

Hint: You may want to look into CASE statements and the LIKE operator. (Lab!)

(10 points) Write a SQL query to select the original_title of all movies and a column where there is a 1 if there exists another movie that has the same vote average and the same runtime as that movie, and a 0 otherwise. Results should be ordered by original_title (A-Z).

Hint: You may want to look into the EXISTS operator. Additionally, think about possible edge cases.

(10 points) Write a SQL query that returns the original_title, vote_average and review of every movie. The reviews depends on the vote_average as described described in the scores table. For example, movies with a vote average between 2 and 3.9 (inclusive) are reviewed as 'poor', whereas movies with a vote average between 9 and 10 (inclusive) are reviewed as 'excellent'. If a movie is reviewed as 'awful' or 'poor' then original_title should read 'do not watch'. Results should be ordered by id (ascending). For example, the output should have the following format:
```
'Snow White'   | 8.7 | 'great'
'Toy Story'    | 9.3 | 'must see'
'do not watch' | 2.3 | 'poor'
```
Hint: Look into the BETWEEN statement and how it can be used in a join.

Another Hint: Do not modify the current database by using UPDATE. Take a look at the CASE operation example from the lab.
(10 points) Write a SQL query that finds the original_title, release_date and revenue of all the movies whose revenue exceeded the average revenue of all the movies released on the same day (including itself). Results should be ordered by release_date (ascending), and then revenue (descending).

(10 points) Write a SQL query that, for each original_language that has more than 2 movies , finds the number of movies that were reviewed as 'poor' and the number of movies that were reviewed as 'good'.

Like in the 4th question, you will need to look at the scores table to see what the review categories are and use the vote_average field of a movie to determine which review category it falls under. Your query should return 3 columns (original_language, num_poor which is the number of 'poor' movies for that language, and num_good which should be the number of 'good' movies for the language). Your results should be ordered by number of 'good' movies (descending) and then number of 'poor' movies (ascending). Remember to only include languages that have more than 2 movies!

Hint: Refer to the examples from the lab!

Part 3: Optimization

10 points

We have provided you with the athletes.db database, although querying it is not necessary at all. The schema is as follows:

school_athletes(ID INTEGER, name TEXT, school TEXT, performance_score INTEGER, division TEXT)

For the query below, explain why the given query might not be the most efficient way to accomplish the task. Write out an optimized version of the query in writeup.txt. Explain what steps you took to optimize it and why your version would be more efficient.

(6 points) The SQL query to optimize is as follows:

SELECT ID, name
      FROM school_athletes AS athletes
      WHERE school = 'Brown' and performance_score > (
        SELECT AVG(performance_score)
        FROM school_athletes
        WHERE division = athletes.division
      );

(4 points) Consider two tables. Table A is very long with 1 billion rows and 5 columns. Table B is very wide with 1000 rows and 10,000 columns. If we were to join the two tables and want to make sure the join is performant, how should we best filter the tables? Assume that we can select from each table and then join the results. Specifically, state the table in which we should use WHERE heavily and the table in which we should be careful about what values we use in our SELECT.

Part 4: Datasets & Society

10 points

CS1951a wants to help you identify and grapple with societal consequences at each stage of the data lifecycle. The design, construction, and socio-historical context of a dataset have social implications that data scientists must consider when creating and using it.

The goal of this section is to:

Examine how the design of database schemas can embed social values and/or inflict harm
Weigh the technical challenges of capturing real-world complexity in data representation
Highlight and evaluate a proposed process for database creators that raises a wide range of of social concerns.

Task:

Read When Binary Code Won’t Accommodate Nonbinary People
Historically, many databases have stored gender and sex as binary variables (i.e. booleans) to minimize memory usage. (While SQLite3 does not actually have a boolean type, many larger-scale SQL databases do; a small INTEGER is often used instead to represent categorical data). .
Answer the questions below in writeup.txt. Each response should be thoughtful, provide justification for your claims, and be concise but complete. See the response guide for more guidance.

Questions

(8 points) Given this database schema:
```
people(ID INTEGER, name TEXT, school TEXT, age INTEGER, race INTEGER, gender INTEGER)
```
1. What are some identities that might not be represented in this database design?
2. How would you modify this schema to represent non-binary genders, or complex and historically erased race identities?
  
  Outline (1) what data types you might use, (2) the possible range of values for your data, (3) the pseudocode for the steps that you'll need to take to update the existing records in the database, and (4) the pseudocode for what - if anything - you would do with pre-existing records, and (5) any data processing you might have to perform on new entires.
3. State at least two pros and cons of your design.
(4 points) Consider a system that allows people to freely write in whatever they identify with (in terms of gender identification or race identification).
1. What would be the pros and cons of this system?
2. What would the data type be of race and gender attributes, and what can be taken in as valid entries?
3. What are some pros and cons of the current design that only allows people to choose from predefined categories?

Read Section 1 (Introduction and Objectives), Section 3 (Questions and Workflow), and Section 4 (Impact and Challenges) of the paper Datasheets for Datasets (Gebru et al., 220).

(4 points) Which datasheet questions stood out to you? Pick three questions that are in different stages of the dataset lifecycle. For each question, provide a specific example real or hypothetical) of a negative societal consequence that the question could help a dataset user or creator to identify and/or prevent. If you find a real example online, provide a link to cite your source.
- For example: The questions “What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?” could have helped identify and prevent errors in automatic COVID-19 data collection in England. Public Health England used an outdated Excel file format in their automatic process to pull COVID-19 data together, leading nearly 16,000 coronavirus cases to go unreported. If the dataset creator had thought of procedures to validate this process ahead of time, they could have identified the loss of data earlier and prevented the threat to public health efforts. (Source: https://www.bbc.com/news/technology-54423988)
(4 points) Identify a social, ethical, or political issue that is not addressed by the proposed datasheet system. Propose a modification or addition to the datasheet system that helps address this issue. Your proposal could involve new sections, questions, external infrastructure, incentives, and more! Explain how your proposal would help address the issue you identified.

Additional Information

Timnit Gebru, the creator of Datasheets for Datasets, is a leader in the field of AI ethics and an advocate for marginalized and underrepresented communities in tech. On December 2nd, Google fired her from her role as co-lead of the company's Ethical Artificial Intelligence team. Gebru said she was fired for sending an email criticizing Google's treatment of marginalized employees. Google's treatment of Gebru has been widely criticized by those within and outside of the tech community, and has sparked discussions about racism in the tech industry, industry censorship of critical research, and corporate diversity efforts. Check out these articles for additional information:

Handing In

After finishing the assignment (and any assignment in the future), run python3 zip_assignment.py in the command line from your assignment directory, and fix any issues brought up by the script.

After the script has been run successfully, you should find the file sql-submission-1951A.zip in your assignment directory. Please submit this zip file on Gradescope under the respective assignment. (If you have not signed up for Gradescope already, please refer to this guide.)

Credits

Made with ♥ by Jens and Esteban, updated for Spring 2021 by Yuchen, Sunny, Nam and Nazem, then again for Summer 2021 by Nam and Neal.

The socially responsible computing component was designed by Lena and Gaurav, updated in Summer 2021 by Evan.

Movie Database from: https://www.kaggle.com/tmdb/tmdb-movie-metadata