Due: 11:59PM Eastern Time, July 13th, 2021
You can refer to the bigger picture roadmap, written by Ellie, here.
In this stage of the final project, you are expected to prepare the necessary data for your exploratory stage.
We provide one stencil file, data-deliverable-report.md. Your group is to fill out this report, as well as submit your code and data samples in this form. Specifically:
.py
or .ipynb
files) that you used to collect and process your data and upload your .zip file in the form. You will also need to provide us with a link to your collected & processed data..csv
, .tsv
, .db
files) that are a subset (10-100 samples) of the big dataset that you will collect in this assignment. More details below.Your team has been matched to a final project TA. The matching can be found here.
During the process of completing the final project, you can reach out to your final project TA to get feedback on your progress. You can ask them about anything (besides completing the final project for you 😛). Things that you can ask them include:
Please try to reach out to your TA in advance to schedule your meeting with them. Concisely share with them as much information as possible when you reach out so they can get to know your project, research in-depth prior to the meeting with you so they can be the most helpful for you. Last but not least - make sure to do your part and try your best before reaching out to them for help.
Along the way, if you have not reached out to them and its getting close to each deadline, they might reach out to you just to check in if you need any help or feedback – feel free to utilize them as a resource!
Your group should create and work together on a Github repository. If you need any support creating a Github repository, please feel free to reach out to your HTA, or your final project TA, to get help with getting started with a Github repository.
There are many viable options for data storage for large files, but some options that we recommend are: Google Drive, Git LFS, or Dropbox. You have the ability to choose whichever file storage that you want - as long as you can provide us with a link to your data, and as long as the members in your team all have easy access to the data so you can smoothly do the project together. If you need help,
Note: In this section, we present a lot of things to consider when you perform tasks related to your data. In your write up or during the process of collecting/cleaning the data, you don’t have to worry too much about having to address all of the points we bring up - just focus on whatever is the most relevant to your project. For example, if you are dealing with a data source that has a ton of missing fields and/or duplicate values but everything else is perfect, it doesn’t make sense to put a lot of time and energy into writing up about how there is no problem with the data distribution or the data type. However, you should not miss out on things that are critical to holistically analyzing your dataset – that’s why we have provided you with a list of questions that you can think and write about when evaluating your dataset and your progress. Just focus on what matters – and we are here to help if you are confused about anything!
TODO:
Write all the code that you need to use to get your data in a manner consistent to the final project roadmap. Use the code to get the data.
In Step 1 of your data-deliverable-report.md
, describe how you collected your data. In terms of outline and pseudocode blocks (in English, no need to write any code at all), enumerate the steps that you took to collect your data.
An example:
> We used `[TECHNOLOGY]` to get this data.
We made requests to the `BIGData` API to get *n* data points,
and to the `BIGGERData` API to get *k* other data points.
From there, we do *X* and *Y* to get the data into the format
that we need for analysis.
Just imagine you being a complete alien to this project - ask yourself whether you would feel confident that you could replicate this data collection procedure based on your description for this question.
Please aim for a response to this question that is at most around 200 words.
Also in Step 1 of data-deliverable-report.md
, answer: How reputable is the source of your data? Are there any other considerations you took into account when collecting your data? This is open-ended based on your data; feel free to leave this blank. (Example: If it’s user data, is it public/are they consenting to have their data used? Is the data potentially skewed in any direction?)
Please aim for a response to this question that is at most around 200 words.
Now, your goal is to get your data into a (few) file(s) that you will use to perform your exploratory stage. This starts first with designing how the format of your data will look like.
In Step 2 section of data-deliverable-report.md
, we want you to describe the format of your data. This includes:
You can refer to this or this example of README for data (in .csv
and .json
formats, respectively) for some inspiration.
Please aim for a response to this question that is at most around 300 words.
TODO:
In the first part of Step 3 of data-deliverable-report.md
, describe your observation of the data prior to cleaning the data (that you will be working with in your analysis). Below are some questions that you may want to answer pertaining to your data:
Please aim for a response to this question that is at most around 250 words.
Clean your data based on the your expectations and observations in Step 2 and Step 3, Part 1. Include all the code that you use to clean your data in the code .zip
file that you will submit to the Google Form. Some things you might want to consider:
Make sure to keep track of the number of data points that you end up dropping, since you will have to write up a report in the third section!
In the second part of Step 3 of data-deliverable-report.md
, describe your data cleaning process and outcome. You can include:
Please aim for a response to this question that is at most around 250 words.
TODO:
Construct a held out test set that you will use to test your hypotheses or evaluate your models later on. The way you choose to do this will depend on the problem domain, so if you aren’t sure how to do it, don’t guess! Talk with your mentor TA or with Ellie early to get advice on what makes sense for your problem. Include all the code that you use to perform this in the code .zip
file that you’ll submit.
In Step 4 section of data-deliverable-report.md
, be sure to include the following information:
Similar to what you did in Step 1, be sure to describe your train-test split in a manner accessible to a complete stranger – making it super easy for them to replicate the work that you have done in this step.
Please aim for a response to this question that is at most around 250 words.
TODO: In Step 5 section of data-deliverable-report.md
, include your answer to the following (for all the points that are applicable):
TODO: In Step 6 section of data-deliverable-report.md
, include your answer to the following questions:
What are the major technical challenges that you anticipate with your project? This can be on anything - from getting more/better data, to building your models.
How have you been distributing the work between group members? What are your plans for work distribution in the future? You can write down the roles that each person has in the team. A quick note: Multiple members can and should certainly work on different parts of the project together!
Please some samples (10-100 rows) of the data that you will be using in the exploratory phase. If you are planning on using multiple files of data in the exploratory phase, generate the samples for each of the file that you will be using. Compress all the data sample files into one .zip
file and submit it on the Google Form.
As mentioned above, please compress all the code that you used in one .zip
file and also submit it on the Google Form.
Score range | Reasoning |
---|---|
90-100 | (1) The group has looked into/sanity checked the data that they retrieved. There should be evidence that the students understand the source and what the attributes are in the data - their types, their values, what the distribution of data is, what the outliers are, and how the different datasets can relate to each other. (2) There should be evidence that the students have looked into things that may need to be changed/removed from the dataset. If some data were removed, students should comment on what was removed and how that may impact the distribution of data. If a join results in many fewer rows than the input tables, they should comment on what was dropped and the impact. (3) The group’s construction of the train and test sets were meaningful, and they are aware of the quantity, quality and distribution of samples in the dataset & each of the train/test set. (4) The group has critically thought about the socio-historical context and impact of using their data, and has a good distribution of work among members of the group. |
80-89 | Only three out of four criteria above were fully met. |
70-79 | Only two out of four criteria above were fully met. |
60-69 | Only one criterion was fully met. |
below 60 | None of the above criteria were fully met. |
Good luck! Please utilize the course staff - ask questions whenever you are confused, or whenever you want some feedback!