# Scikit-learn

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
twenty_train = fetch_20newsgroups(
    subset="train",
    categories=['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'],
    shuffle=True, random_state=42)

In [3]:
type(twenty_train)

sklearn.utils.Bunch

In [4]:
isinstance(twenty_train, dict)

True

In [5]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
print(twenty_train.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [7]:
len(twenty_train.data)

2257

In [8]:
print(twenty_train.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [9]:
print(twenty_train.filenames[0])

/Users/nickie/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38440


In [11]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [12]:
print(twenty_train.target[0])

1


In [13]:
twenty_test = fetch_20newsgroups(
    subset="test",
    categories=['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'],
    shuffle=True, random_state=42)

In [14]:
len(twenty_test.data)

1502

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

The [`CountVectorizer` object](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [16]:
v = CountVectorizer()

In [17]:
tdm = v.fit_transform([
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
])

In [18]:
v.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [19]:
type(tdm)

scipy.sparse.csr.csr_matrix

In [20]:
tdm.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

The [`TfidTransformer` object](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [21]:
t = TfidfTransformer()

In [22]:
tf_idf = t.fit_transform(tdm)

In [23]:
t.idf_

array([1.91629073, 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.91629073, 1.        , 1.91629073, 1.        ])

In [24]:
type(tf_idf)

scipy.sparse.csr.csr_matrix

In [25]:
tf_idf.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

Adding a pipeline with a [`SGDClassifier` object](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) as the third layer.

In [26]:
text_classify_pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2'))
])

In [27]:
parameters = {
    'vect__ngram_range' :[(1, 1), (1, 2)],
    'tfidf__use_idf' : (True, False),
    'clf__alpha' : (1e-2, 1e-3)
}

In [28]:
from sklearn.model_selection import GridSearchCV

In [29]:
gs_clf = GridSearchCV(text_classify_pipeline, parameters, cv=5, n_jobs=-1)

In [30]:
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [31]:
gs_clf.best_score_

0.9672131147540983

In [32]:
for param in sorted(parameters.keys()):
  print("{}: {}".format(param, gs_clf.best_params_[param]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


In [33]:
final_pipeline = Pipeline([
    ("vect", CountVectorizer(ngram_range=(1,1))),
    ("tfidf", TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=0.001))
])

In [34]:
final_pipeline = final_pipeline.fit(twenty_train.data, twenty_train.target)

In [35]:
final_pipeline.score(twenty_test.data, twenty_test.target)

0.9114513981358189