Introduction to Text Analytics with R Part 1 | Overview

Share it with your friends Like

Thanks! Share it with your friends!

Close

This data science series introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:

– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models

The overview of this video series provides an introduction to text analytics as a whole and what is to be expected throughout the instruction. It also includes specific coverage of:

– Overview of the spam dataset used throughout the series
– Loading the data and initial data cleaning
– Some initial data analysis, feature engineering, and data visualization

Kaggle Dataset:
https://www.kaggle.com/uciml/sms-spam-collection-dataset

The data and R code used in this series is available here:
https://code.datasciencedojo.com/datasciencedojo/tutorials/tree/master/Introduction%20to%20Text%20Analytics%20with%20R

Learn more about Data Science Dojo here:
https://datasciencedojo.com/data-science-bootcamp/

Watch the latest video tutorials here:
https://tutorials.datasciencedojo.com/

See what our past attendees are saying here:
https://datasciencedojo.com/bootcamp/reviews/#videos

Like Us: https://www.facebook.com/datasciencedojo
Follow Us: https://twitter.com/DataScienceDojo
Connect with Us: https://www.linkedin.com/company/datasciencedojo

Also find us on:
Instagram: https://www.instagram.com/data_science_dojo
Vimeo: https://vimeo.com/datasciencedojo

#rprogramming #textanalytics #rtutorial

Comments

Opal Cross Coaching says:

This is great content on text mining in R. I also have a channel that discusses text mining in R on data from the web, PDF documents and data frames.

163ii says:

Well articulated and clear. Thanks so much for this video.

Nnenna Umelloh says:

This is great! Thank you!

kebman says:

I've always been curious about the usage of Neo4j and graph databases in conjuction with text analytics. Of course, I'm a complete noob in this field, but it never the less fascinates me. So how would you do that?

kebman says:

Lol I have never used R. Let's hope for the best, guys! <3 (I do use multiple other languages tho, and even Perl, so I'm gonna give this my best shot lol!) Edit: I already think math's gonna be a problem lol. Halp! I'm in way over my head now guys!!! xD

Sophie J says:

Thank you so much! If anyone has errors to replicate the lecture, use the codes and dataset the speaker uploaded. The link is above.

Matthew Graham says:

HOW TO GET AROUND ERROR PRESENTED AT ~ @24:00

spam.raw$TextLength <- nchar(x = spam.raw$Text, type = "chars", allowNA = TRUE)

pay attention to that last argument, the "allowNA = TRUE" is what makes it all work. Check the documentation by typing ?nchar() into the console.

NOTE: doing this will give you different results when you use the summary() function on this new feature. That is simply because you're allowing NAs to be counted (I think?). It doesn't change that much though from what's in the video. If someone has a better fix or idea on how to deal with this please comment below with your reply.

You're welcome 🙂

seaman sun says:

when I run the rode"spam.raw$TextLength <- nchar(spam.raw$Text)

summary(spam.raw$TextLength)"

It appears"Error in nchar(spam.raw$Text) : invalid multibyte string, element 634"

Who can help me, Thanks very much.

Shawn Huang says:

Great work, thanks David for the wonderful explanation.

Vansh Jauhari says:

Hi Dave, my data is showing 2 missing values??

Aditya Raj says:

hey @Dave getting this as error
package or namespace load failed for ‘quanteda’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):

namespace ‘rlang’ 0.4.2 is already loaded, but >= 0.4.3 is required
please help me out here please

Ahmad Alamer says:

Thank you for your contribution for the world David! you are amazing and your parents are proud of you.

مشاعل says:

what if i want to exclude stop words from stop_words() list how can i do it? i tried to to make custom stopwords but it didn't work.

slk slk says:

To the MALAYALI data scientist who noticed something at 23:42

مشاعل says:

Hi.
what should i learn first? natural language processing or text analysis?

Write a comment

*

Area 51
Ringing

Answer