Data Deliverable - Summer 2021 Final Project

Due: 11:59PM Eastern Time, July 13th, 2021

Overview

You can refer to the bigger picture roadmap, written by Ellie, here.

In this stage of the final project, you are expected to prepare the necessary data for your exploratory stage.

Logistics

Stencil & Handin

We provide one stencil file, data-deliverable-report.md. Your group is to fill out this report, as well as submit your code and data samples in this form. Specifically:

Code: You need to compress all the code (e.g., .py or .ipynb files) that you used to collect and process your data and upload your .zip file in the form. You will also need to provide us with a link to your collected & processed data.
Data: You need to compress all your data samples (e.g., .csv, .tsv, .db files) that are a subset (10-100 samples) of the big dataset that you will collect in this assignment. More details below.
Data Deliverable Report: You need to complete answer all the questions in the stencil file and submit the filled out file.

Your final project TA

Your team has been matched to a final project TA. The matching can be found here.

During the process of completing the final project, you can reach out to your final project TA to get feedback on your progress. You can ask them about anything (besides completing the final project for you 😛). Things that you can ask them include:

Getting feedback on your project idea, on your project progress, or on your write ups before you hand in your deliverables.
Getting help on how to write code to perform whatever data collection/analysis work that you need to do
Voicing concern over the dynamic in your team (unequal distribution of work, one being left out, toxic work environment, etc.) and over the final project requirements/guidelines
And literally anything!

Please try to reach out to your TA in advance to schedule your meeting with them. Concisely share with them as much information as possible when you reach out so they can get to know your project, research in-depth prior to the meeting with you so they can be the most helpful for you. Last but not least - make sure to do your part and try your best before reaching out to them for help.

Along the way, if you have not reached out to them and its getting close to each deadline, they might reach out to you just to check in if you need any help or feedback – feel free to utilize them as a resource!

Github Repository

Your group should create and work together on a Github repository. If you need any support creating a Github repository, please feel free to reach out to your HTA, or your final project TA, to get help with getting started with a Github repository.

Data Storage

There are many viable options for data storage for large files, but some options that we recommend are: Google Drive, Git LFS, or Dropbox. You have the ability to choose whichever file storage that you want - as long as you can provide us with a link to your data, and as long as the members in your team all have easy access to the data so you can smoothly do the project together. If you need help,

Data Collection, Cleaning, and Processing

Note: In this section, we present a lot of things to consider when you perform tasks related to your data. In your write up or during the process of collecting/cleaning the data, you don’t have to worry too much about having to address all of the points we bring up - just focus on whatever is the most relevant to your project. For example, if you are dealing with a data source that has a ton of missing fields and/or duplicate values but everything else is perfect, it doesn’t make sense to put a lot of time and energy into writing up about how there is no problem with the data distribution or the data type. However, you should not miss out on things that are critical to holistically analyzing your dataset – that’s why we have provided you with a list of questions that you can think and write about when evaluating your dataset and your progress. Just focus on what matters – and we are here to help if you are confused about anything!

Step 1. Construct your dataset

TODO:

Write all the code that you need to use to get your data in a manner consistent to the final project roadmap. Use the code to get the data.
In Step 1 of your data-deliverable-report.md, describe how you collected your data. In terms of outline and pseudocode blocks (in English, no need to write any code at all), enumerate the steps that you took to collect your data.

An example:
```
 > We used `[TECHNOLOGY]` to get this data.
 We made requests to the `BIGData` API to get *n* data points,
 and to the `BIGGERData` API to get *k* other data points. 
 From there, we do *X* and *Y* to get the data into the format 
 that we need for analysis.
```
Just imagine you being a complete alien to this project - ask yourself whether you would feel confident that you could replicate this data collection procedure based on your description for this question.

Please aim for a response to this question that is at most around 200 words.
Also in Step 1 of data-deliverable-report.md, answer: How reputable is the source of your data? Are there any other considerations you took into account when collecting your data? This is open-ended based on your data; feel free to leave this blank. (Example: If it’s user data, is it public/are they consenting to have their data used? Is the data potentially skewed in any direction?)

Please aim for a response to this question that is at most around 200 words.

Step 2: Define your data schema

Now, your goal is to get your data into a (few) file(s) that you will use to perform your exploratory stage. This starts first with designing how the format of your data will look like.

In Step 2 section of data-deliverable-report.md, we want you to describe the format of your data. This includes:

Data types/assumption of data types: For each attribute of your interest in your dataset, what type do you expect your attribute to be? E.g. int? float? str? VARCHAR? What does this attribute represent? You don’t have to list out the data types and the meaning of every single variables in your dataset - you can just focus on the attributes that you think would matter the most in your exploratory phase later.
Keys/cross-references: What are the keys (primary keys, foreign keys) of your datasets? Even if your data is not in a SQL database, it is important to clearly define this, as you would want to know exactly which attribute to join your different data tables on, or which feature(s) to use to identify unique instances of data.
Required/Optional fields: Which fields in your data is required? Which are optional?

You can refer to this or this example of README for data (in .csv and .json formats, respectively) for some inspiration.

Please aim for a response to this question that is at most around 300 words.

Step 3: Clean & Process your data

TODO:

In the first part of Step 3 of data-deliverable-report.md, describe your observation of the data prior to cleaning the data (that you will be working with in your analysis). Below are some questions that you may want to answer pertaining to your data:
- How many data points are there total? How many are there in each group you care about (e.g. if you are dividing your data into positive/negative examples, are they split evenly)?
- Are there missing values? Do these occur in fields that are important for your project’s goals?
- Are there duplicates? Do these occur in fields that are important for your project’s goals?
- How is the data distributed? Is it uniform or skewed? Are there outliers? What are the min/max values? (focus on the fields that are most relevant to your project goals)
- Are there any data type issues (e.g. words in fields that were supposed to be numeric)? Where are these coming from? (E.g. a bug in your scraper? User input?) How will you fix them?
- Are you joining some tables together? How many rows from each table will be joined together, and on which attributes?
Please aim for a response to this question that is at most around 250 words.
Clean your data based on the your expectations and observations in Step 2 and Step 3, Part 1. Include all the code that you use to clean your data in the code .zip file that you will submit to the Google Form. Some things you might want to consider:
- Making sure that your data features are of the type that you expect
- Dropping rows with duplicates or missing values
- Dropping the outliers
- Dropping rows with unfixable data type issues that might screw up with your processing later
Make sure to keep track of the number of data points that you end up dropping, since you will have to write up a report in the third section!
In the second part of Step 3 of data-deliverable-report.md, describe your data cleaning process and outcome. You can include:
- For each step that you took to clean and process If a join results in many fewer rows than the input tables, they should the data: what is the step? How many samples are you left with after each step? What are some examples of the samples that are being removed? Why did you make this data cleaning decision?
- At the end, how many samples are you left with? Do you think this is enough data to perform your analysis later on? Are the samples that you’re left with representative, or is it likely to exhibit some sort of sampling bias?
- Was the data that was lost important/noteworthy in any way (i.e., did it affect the distribution of your data meaningfully)?
Please aim for a response to this question that is at most around 250 words.

Step 4: Define a held out test set

TODO:

Construct a held out test set that you will use to test your hypotheses or evaluate your models later on. The way you choose to do this will depend on the problem domain, so if you aren’t sure how to do it, don’t guess! Talk with your mentor TA or with Ellie early to get advice on what makes sense for your problem. Include all the code that you use to perform this in the code .zip file that you’ll submit.
In Step 4 section of data-deliverable-report.md, be sure to include the following information:
- What is the size of your training dataset? What is the size of your test set?
- What is the way with which you chose the test set? Did you do random sampling on the entire dataset, or did you do something more intricate to select the test samples? Why did you make this choice?
Similar to what you did in Step 1, be sure to describe your train-test split in a manner accessible to a complete stranger – making it super easy for them to replicate the work that you have done in this step.

Please aim for a response to this question that is at most around 250 words.

Step 5: Socio-historical Context & Impact Report

TODO: In Step 5 section of data-deliverable-report.md, include your answer to the following (for all the points that are applicable):

Who are the major stakeholders of your project? Think about the people that are represented in your dataset, and/or the people whom your project will impact by using this dataset - we want specific mentions of particular demographic groups.
How could an individual or particular community’s privacy be affected by the aggregation or analysis of your data? What are the ethical problems that you could see arising in the course of doing the project?
What kind of underlying historical or societal biases might your data contain? How can this bias be mitigated? Think about your choices about which data sources to use, how to deal with missing data, etc.
If applicable, what is the socio-historical impact of previous works that have been done using your data/on your (expected) topic? How is the anticipated impact of your project similar/different?

Step 6: Team Report

TODO: In Step 6 section of data-deliverable-report.md, include your answer to the following questions:

What are the major technical challenges that you anticipate with your project? This can be on anything - from getting more/better data, to building your models.
How have you been distributing the work between group members? What are your plans for work distribution in the future? You can write down the roles that each person has in the team. A quick note: Multiple members can and should certainly work on different parts of the project together!

Wrapping Up

Generating samples of your data

Please some samples (10-100 rows) of the data that you will be using in the exploratory phase. If you are planning on using multiple files of data in the exploratory phase, generate the samples for each of the file that you will be using. Compress all the data sample files into one .zip file and submit it on the Google Form.

Compressing all the code that you used

As mentioned above, please compress all the code that you used in one .zip file and also submit it on the Google Form.

Rubric

Score range	Reasoning
90-100	(1) The group has looked into/sanity checked the data that they retrieved. There should be evidence that the students understand the source and what the attributes are in the data - their types, their values, what the distribution of data is, what the outliers are, and how the different datasets can relate to each other. (2) There should be evidence that the students have looked into things that may need to be changed/removed from the dataset. If some data were removed, students should comment on what was removed and how that may impact the distribution of data. If a join results in many fewer rows than the input tables, they should comment on what was dropped and the impact. (3) The group’s construction of the train and test sets were meaningful, and they are aware of the quantity, quality and distribution of samples in the dataset & each of the train/test set. (4) The group has critically thought about the socio-historical context and impact of using their data, and has a good distribution of work among members of the group.
80-89	Only three out of four criteria above were fully met.
70-79	Only two out of four criteria above were fully met.
60-69	Only one criterion was fully met.
below 60	None of the above criteria were fully met.

Last words

Good luck! Please utilize the course staff - ask questions whenever you are confused, or whenever you want some feedback!