blog winnerscommunity

An interview with "Countable Care" winner Gilberto Titericz Jr.

We got a chance to hear from Gilberto Titericz Jr., our Countable Care 1st place finisher. He answered some of our questions about himself, the competition, and data science in general.

Gilberto Titericz Jr.

We got a chance to hear from Gilberto Titericz Jr., our Countable Care 1st place finisher. He answered some of our questions about himself, the competition, and data science in general.

Congratulations on winning the Countable Care competition! Who are you and what do you do?

I'm an electronics engineer with a M.S. in telecommunications. In the past 16 years I've been working as an engineer for big multinationals like Siemens and Nokia, and later as an automation engineer for Petrobras Brasil.

What first got you interested in data science?

In 2008 I started to learn by myself many machine learning techniques, first trying to improve personal stock market gains. In 2012 I started to compete in international Machine Learning competitions and have learned a lot more since then.

How long have you been at Petrobras, what kind of problems do you work on, and what were you doing before?

I've been working for Petrobras since 2008. My current working day includes analyzing and optimizing automation logic of heavy equipment.

What brought you to DrivenData or this problem in particular?

I try to join all data science competitions that I can. When I placed 3rd in DrivenData's previous competition, I just waited for a new one and when "Countable Care" was released I rapidly joined it and started coding. The dataset is small but very good to work on.

Can you describe your initial approach to this problem? When you’re first exploring a new data set, what are your personal habits for getting familiar with the data?

My approach is based on an ensemble of models. To make it possible, I trained a lot of models, with different hyperparameters, different numbers of features, and different learning techniques. That was the first level of learning.

All training was made using a 4 fold cross validation technique, where it generates two predictions sets: one cross validated train set and one test set for each model. All those models are then ensembled using a second level of learning, training with all the cross validated train sets (meta features), and then the final model is applied in the predicted test sets.

That approach was chosen because the dataset is composed of 1,379 features, most of it categorical, and training instances are not big enough to do a reliable feature selection. As the predictive power of the dataset comes mainly from the categorical features whose features don't have many levels, I decided to quit simple feature selection and use the ensemble technique to better explore feature interaction between all features and different learning techniques. My final ensemble was composed of 9 models:

  • 1 Random Forest
  • 6 Xgboost
  • 2 Logistic Regression

When first exploring a new dataset I always visualize the data, try to identify categorical and numeric features, identify correlation of features with each other and target, then run a single cross-validated training to observe the performance that dataset can achieve.

Was there anything that surprised you in working on this particular problem? Anything that worked particularly well (or not well) that you weren’t necessarily expecting?

Ensemble of model at a second level worked very well at this competition. But one thing that did not work very well was feature selection. Probably due to the size of the dataset, but selecting good features just led to overfitting.

What programming languages do you like to use for data science tasks? What are your thoughts on the current and future scientific computing ecosystem?

I use R 70% of the time and Python 30%. Machine learning tools, services and products are each day more in demand in the world market. So there is a great future for Data Science, because the scientific environment will grow even bigger in the next few years.

And a company with a talented Data Scientist equipped with high end software and hardware can be a secret weapon.

What machine learning research interests you right now? Have you read any good papers lately?

I like ensembling and feature selection techniques. Also I started to learn convolutional networks and read some articles about that. But it needs good hardware, so buying a PC with a good GPU is my next step.

It seems like data science in particular has an incredible number of self-directed learners and people with nontraditional career paths. Do you have any recommendations for people trying to break into the field and learn the sort of applied skills you used in this competition?

There is no end in the Data Science learning curve. You always are learning something new. I found joining competitions are the best way to build strong knowledge in that field. Mostly because it obligates you to read papers, articles and forums about a specific problem. Also it makes you think about new possibilities and makes you try different aproaches to the problems. So...Join a competition and learn more and more!

Now some just for fun questions about your setup (hat tip to usesthis.com!) -- in your daily personal and working life, what hardware do you use? Would you mind sharing a picture of your hardware setup/desk?

My home hardware is a mid-end Toshiba laptop equipped with a Core i7 generation 3, a 120 GB SSD main disk, and 24 GB of RAM.

My work hardware is a low-end Core i5 with 4GB of RAM :-((

And what software?

My laptop smoothly runs Ubuntu 14.04 LTS, R Studio Server and Spyder Python IDE. Those are the basics, and enough.

Would you mind sharing a screenshot of your desktop?

Here it is:


A big thanks goes out to Gilberto from the DrivenData team for taking the time to chat. Feel free to discuss this interview in the DrivenData community forum, and stay tuned for the conclusion of our Keeping it Fresh competition!

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

tutorial

NASA Pose Bowl - Benchmark

An introduction to the NASA Pose Bowl competition, with a benchmark solution for the object detection track

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.