An interview with "Countable Care" winner Gilberto Titericz Jr.

We got a chance to hear from Gilberto Titericz Jr., our Countable Care 1st place finisher. He answered some of our questions about himself, the competition, and data science in general.

Congratulations on winning the Countable Care competition! Who are you and what do you do?¶

I'm an electronics engineer with a M.S. in telecommunications. In the past 16 years I've been working as an engineer for big multinationals like Siemens and Nokia, and later as an automation engineer for Petrobras Brasil.

What first got you interested in data science?¶

In 2008 I started to learn by myself many machine learning techniques, first trying to improve personal stock market gains. In 2012 I started to compete in international Machine Learning competitions and have learned a lot more since then.

How long have you been at Petrobras, what kind of problems do you work on, and what were you doing before?¶

I've been working for Petrobras since 2008. My current working day includes analyzing and optimizing automation logic of heavy equipment.

What brought you to DrivenData or this problem in particular?¶

I try to join all data science competitions that I can. When I placed 3rd in DrivenData's previous competition, I just waited for a new one and when "Countable Care" was released I rapidly joined it and started coding. The dataset is small but very good to work on.

Can you describe your initial approach to this problem? When you’re first exploring a new data set, what are your personal habits for getting familiar with the data?¶

My approach is based on an ensemble of models. To make it possible, I trained a lot of models, with different hyperparameters, different numbers of features, and different learning techniques. That was the first level of learning.

All training was made using a 4 fold cross validation technique, where it generates two predictions sets: one cross validated train set and one test set for each model. All those models are then ensembled using a second level of learning, training with all the cross validated train sets (meta features), and then the final model is applied in the predicted test sets.

That approach was chosen because the dataset is composed of 1,379 features, most of it categorical, and training instances are not big enough to do a reliable feature selection. As the predictive power of the dataset comes mainly from the categorical features whose features don't have many levels, I decided to quit simple feature selection and use the ensemble technique to better explore feature interaction between all features and different learning techniques. My final ensemble was composed of 9 models:

1 Random Forest
6 Xgboost
2 Logistic Regression

When first exploring a new dataset I always visualize the data, try to identify categorical and numeric features, identify correlation of features with each other and target, then run a single cross-validated training to observe the performance that dataset can achieve.

Was there anything that surprised you in working on this particular problem? Anything that worked particularly well (or not well) that you weren’t necessarily expecting?¶

Ensemble of model at a second level worked very well at this competition. But one thing that did not work very well was feature selection. Probably due to the size of the dataset, but selecting good features just led to overfitting.

What programming languages do you like to use for data science tasks? What are your thoughts on the current and future scientific computing ecosystem?¶

I use R 70% of the time and Python 30%. Machine learning tools, services and products are each day more in demand in the world market. So there is a great future for Data Science, because the scientific environment will grow even bigger in the next few years.

And a company with a talented Data Scientist equipped with high end software and hardware can be a secret weapon.

What machine learning research interests you right now? Have you read any good papers lately?¶

I like ensembling and feature selection techniques. Also I started to learn convolutional networks and read some articles about that. But it needs good hardware, so buying a PC with a good GPU is my next step.

It seems like data science in particular has an incredible number of self-directed learners and people with nontraditional career paths. Do you have any recommendations for people trying to break into the field and learn the sort of applied skills you used in this competition?¶

There is no end in the Data Science learning curve. You always are learning something new. I found joining competitions are the best way to build strong knowledge in that field. Mostly because it obligates you to read papers, articles and forums about a specific problem. Also it makes you think about new possibilities and makes you try different aproaches to the problems. So...Join a competition and learn more and more!

My home hardware is a mid-end Toshiba laptop equipped with a Core i7 generation 3, a 120 GB SSD main disk, and 24 GB of RAM.

My work hardware is a low-end Core i5 with 4GB of RAM :-((

And what software?¶

My laptop smoothly runs Ubuntu 14.04 LTS, R Studio Server and Spyder Python IDE. Those are the basics, and enough.

Here it is:

A big thanks goes out to Gilberto from the DrivenData team for taking the time to chat. Feel free to discuss this interview in the DrivenData community forum, and stay tuned for the conclusion of our Keeping it Fresh competition!

An interview with "Countable Care" winner Gilberto Titericz Jr.

Congratulations on winning the Countable Care competition! Who are you and what do you do?¶

What first got you interested in data science?¶

How long have you been at Petrobras, what kind of problems do you work on, and what were you doing before?¶

What brought you to DrivenData or this problem in particular?¶

Can you describe your initial approach to this problem? When you’re first exploring a new data set, what are your personal habits for getting familiar with the data?¶

Was there anything that surprised you in working on this particular problem? Anything that worked particularly well (or not well) that you weren’t necessarily expecting?¶

What programming languages do you like to use for data science tasks? What are your thoughts on the current and future scientific computing ecosystem?¶

What machine learning research interests you right now? Have you read any good papers lately?¶

It seems like data science in particular has an incredible number of self-directed learners and people with nontraditional career paths. Do you have any recommendations for people trying to break into the field and learn the sort of applied skills you used in this competition?¶

And what software?¶

Tags

Latest posts

Automating wildlife monitoring with Zamba & Zamba Cloud

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Mapping agricultural trends in Yemen during crisis

Making higher education data more accessible

Goodnight Moon, Hello Early Literacy Screening Benchmark

Work with us to build a better world

Congratulations on winning the Countable Care competition! Who are you and what do you do?¶

What first got you interested in data science?¶

How long have you been at Petrobras, what kind of problems do you work on, and what were you doing before?¶

What brought you to DrivenData or this problem in particular?¶

Can you describe your initial approach to this problem? When you’re first exploring a new data set, what are your personal habits for getting familiar with the data?¶

Was there anything that surprised you in working on this particular problem? Anything that worked particularly well (or not well) that you weren’t necessarily expecting?¶

What programming languages do you like to use for data science tasks? What are your thoughts on the current and future scientific computing ecosystem?¶

What machine learning research interests you right now? Have you read any good papers lately?¶

It seems like data science in particular has an incredible number of self-directed learners and people with nontraditional career paths. Do you have any recommendations for people trying to break into the field and learn the sort of applied skills you used in this competition?¶

Now some just for fun questions about your setup (hat tip to usesthis.com!) -- in your daily personal and working life, what hardware do you use? Would you mind sharing a picture of your hardware setup/desk?¶

And what software?¶

Would you mind sharing a screenshot of your desktop?¶

Tags

Stay updated

Latest posts

Work with us to build a better world