找回密码
 立即注册

QQ登录

只需一步,快速开始

扫一扫,访问微社区

查看: 2255|回复: 4
收起左侧

[Data Science] Capital one data scientist intern code challenge

[复制链接]

1

主题

1

精华

22

积分

新米人

Rank: 1

积分
22
发表于 5-2-2016 11:37 PM | 显示全部楼层 |阅读模式

亲!马上注册或者登录会查看更多内容!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
--------------------------------------------------------------------------------
|         Code Test Part 1: Model building on a synthetic dataset              |
--------------------------------------------------------------------------------

We have provided two tab-delimited files along with these instructions:

    - codetest_train.txt: 5,000 records x 254 features + 1 target (~7.8MB)
    - codetest_test.txt : 1,000 records x 254 features            (~1.5MB)

These two synthetic datasets were generated using the same underlying data
model. Your goal is to build a predictive model using the data in the training
dataset to predict the withheld target values from the test set.

You may use any tools available to you for this task. Ultimately, we will
assess predictive accuracy on the test set using the mean squared error metric.
You should return to us the following:

    - A 1,000 x 1 text file containing 1 prediction per line for each record
        in the test dataset.

    - A brief writeup describing the techniques you used to generate the
        predictions. Details such as important features and your estimates of
        predictive performance are helpful here, though not strictly
        necessary.

    - (Optional) An implementable version of your model. What this would look
        like largely depends on the methods you used, but could include things
        like source code, a pickled Python object, a PMML file, etc. Please
        do not include any compiled executables. If you choose not to submit
        this, please ensure your modeling methods are adequately described
        in the writeup.


--------------------------------------------------------------------------------
|                       Code Test Part 2: Baby Names!                          |
--------------------------------------------------------------------------------

In this section, you will acquire and analyze a real dataset on baby name
popularity provided by the Social Security Administration. To warm up, we will
ask you a few simple questions that can be answered by inspecting the data.

A) Descriptive analysis

The data can be downloaded in zip format from:
http://www.ssa.gov/oact/babynames/state/namesbystate.zip

1.  Please describe the format of the data files. Can you identify any
    limitations or distortions of the data?
2.  What is the most popular name of all time? (Of either gender.)
3.  What is the most gender ambiguous name in 2013? 1945?
4.  Of the names represented in the data, find the name that has had the largest
    percentage increase in popularity since 1980. Largest decrease?
5.  Can you identify names that may have had an even larger increase or decrease
    in popularity?


B) Onward to Insight!

What insight can you extract from this dataset? Feel free to combine the baby
names data with other publicly available datasets or APIs, but be sure to include
code for accessing any alternative data that you use.

This is an open-ended question and you are free to answer as you see fit. In
fact, we would love it if you find an interesting way to look at the data that
we haven't thought of!

Please provide us with both your code and an informative write-up of your
results. The code should be in a runnable form. Do not assume that we have a
copy of the data set or that we are familiar with the build procedures for your
chosen language.  



评分

参与人数 1金钱 +3 收起 理由
Sophia + 3 给您点个赞!大米满满送上!

查看全部评分

0

主题

0

精华

4

积分

新米人

Rank: 1

积分
4
发表于 5-4-2016 04:37 PM 来自美国米群网手机版 | 显示全部楼层
感谢李成蹊分享~~~
回复 支持 反对

使用道具 举报

0

主题

0

精华

6

积分

新米人

Rank: 1

积分
6
发表于 5-5-2016 04:19 AM 来自美国米群网手机版 | 显示全部楼层
楼主李成蹊帖子好赞~~~
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

快速回复 返回顶部 返回列表