In this lesson, we will learn how to train a Naive Bayes classifier or a Logistic Regression classifier - basic machine learning algorithms - in order to classify text into categories.
Should I split documents into single sentences or use them as is to train text classification model? I was wondering what's the best way to feed the model with training data.
Can i just use the document as is? like this: {"phrase": "First long document with up to 30 sentences", "result": {"label 1": 1}}, {"phrase": "first long document with up to 30 sentences", "result": {"label 2": 1}} {"phrase": "Second long document with up to 30 sentences", "result": {"label 2": 1}}, etc. Or, should I split all documents into sentences and then the data will look like something this: {"phrase": "Sentence 1 out of document 1", "result": {"label 1": 1}}, {"phrase": "Sentence 2 out of document 1", "result": {"label 2": 1}}, etc.
{"phrase": "Sentence 1 out of document 2", "result": {"label 5": 1}}, etc.
{"phrase": "Sentence X out of document X", "result": {"No labels at all": 1}}, etc.
Same question about using the model, should I just apply it on the complete document or should I split it to separate sentences then apply the model on each sentence.
What's the best practice?
Also, how do i approach multiple categories classification ?