Tugas Besar Data Knowledge and Engineering

Mochamad Taufik Pratama-1103130243

Dalam Pelaksanaan tugas besar kali yaitu bagaimana mengekstrak informasi dari suatu Artikel. Artikel berjumlah 10 buah, artikel  yang  digunakan bersumber dari https://www.newsinlevels.com

Berikut adalah penjelasan source code
Dalam melakukan ekstrak informasi maka library yang dibutuhkan adalah


from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

Selanjutnya

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []
               
    for i in chunked:
             if type(i) == Tree:
                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
             elif current_chunk:
                     named_entity = " ".join(current_chunk)
                     if named_entity not in continuous_chunk:
                             continuous_chunk.append(named_entity)
                             current_chunk = []
             else:
                    continue
    return continuous_chunk

#setiap spasi menandakan berita yang berbeda

berita = '''
A photo of Britain's Prince George has made animal rights groups angry. The photo was for the prince's third birthday. 
It shows him offering white chocolate ice cream to his pet dog. 
The charity, the Royal Society for the Prevention of Cruelty to Animals (RSPCA), said Prince George was trying to be kind to his dog, but it wasn't a good thing to do. 
It said chocolate and ice cream are bad for dogs. The RSPCA said it did not advise others to do the same as George.

North Korea tries to launch a missile. The USA reacts. It sends a warship to Korea. The situation is getting more and more serious.

A supply ship is going to the warship. It needs protection. A Japanese warship goes with it.

This story is about a man. He is from the USA. He is a university professor. He goes to Nepal. He climbs a mountain. The mountain is covered in ice.

He does not want to die. He climbs out of the hole. People find him the next day.

There is a video about a killer whale baby. The baby orca was born in Sea World park in Texas. Her name is Takara.

This is the last baby orca born in the park. It is the end of the breeding programme in Sea World. Some people worry about orcas which do not live in the open ocean. Activists want to move the mother and the baby into a wildlife reservation.

Researchers have a new way to film whales in Antarctica. The digital tags have a camera. They also get information on the whales.

The information tells us where the whales get their food. The researchers want to protect the whales. The cameras show us what the whales world looks like.

Here is some animal news. It is from a zoo in the USA. The zookeepers let some animals play.

They give the animals musical instruments. They want the animals to have some fun. The otters play the keyboard. An orangutan plays the xylophone.

Here is news from India. A leopard is on the roof of a house. People start to panic. They are scared of the animal. They move away from the leopard.

One man tries to get away. The animal attacks him! The leopard moves into the village. It hides in a small house. It is scared of the people, too. Nobody else is injured.

Here is some news from Norway. A man is skydiving. Something flies by. It looks like a black rock. The man thinks that it is a meteorite.

Most meteorites burn up when they enter the atmosphere. However, some meteorites survive. The man is working with scientists. They are trying to find the meteorite.

The people are in Costa Rica. They upload the video on YouTube recently.

The video is amazing. It shows a man. He feeds a crocodile. He does not hold the fish in his hand. He holds it in his mouth.

Here is some news from a Washington zoo. It is about little lions. They must pass a swimming test.

People throw them into water, and the animals must swim. All four cats hold their heads above the water. Three of them swim. One cat does not want to swim. It gets out of the water quickly.

Small cats must pass the test at the zoo. The test covers their ability to swim. Water must be no danger to these cats.
'''
print get_continuous_chunks(berita)

Selanjutnya  kita coba print bagaimana hasil dari ekstak informasi menggunakan perintah fungsi print diatas, hasil yang didapatkan adalah



Dari hasil ekstrak informasi terlihat bahwa terdapat entitas kata-kata penting dalam berita. Untuk mengetahui relasi kata mana saja yang saling berhubungan satu sama lain, maka dibutuhkan suatu graph yang menggambarkan relasi entitas setiap kata
Dalam membuat relasi entitas , pada tugas ini memakai API dari plot.ly dimana akan digunakan untuk memanfaatkan fitur graph guna menggambarkan relasi. Dalam membuat akses API dapat dibuat pada https://plot.ly/

Hal yang pertama digunakan adalah penambahan library yang terdiri dari
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import plotly.plotly as py
from plotly.graph_objs import *

Kemudian membuat fungsi edge dan node yang digunakan untuk membuat garis dan posisi node

def scatter_nodes(pos, labels=None, color=None, size=20, opacity=1):
    L=len(pos)
    trace = Scatter(x=[], y=[],  mode='markers', marker=Marker(size=[]))
    for k in range(L):
        trace['x'].append(pos[k][0])
        trace['y'].append(pos[k][1])
    attrib=dict(name='', text=labels , hoverinfo='text', opacity=opacity) # a dict of Plotly node attributes
    trace=dict(trace, **attrib)# concatenate the dict trace and attrib
    trace['marker']['size']=size
    return trace

def scatter_edges(G, pos, line_color=None, line_width=1):
    trace = Scatter(x=[], y=[], mode='lines')
    for edge in G.edges():
        trace['x'] += [pos[edge[0]][0],pos[edge[1]][0], None]
        trace['y'] += [pos[edge[0]][1],pos[edge[1]][1], None] 
        trace['hoverinfo']='none'
        trace['line']['width']=line_width
        if line_color is not None: # when it is None a default Plotly color is used
            trace['line']['color']=line_color
    return trace  

Setelah mengatur node dan edge maka selanjutnya adalah membuat fungsi anotasi yaitu bagaimana membuat kata-kata pada hasil ektrak tidak saling menumpuk dan memberikan nama pada setiap node

def make_annotations(pos, text, font_size=14, font_color='rgb(25,25,25)'):
    L=len(pos)
    if len(text)!=L:
        raise ValueError('The lists pos and text must have the same len')
    annotations = Annotations()
    for k in range(L):
        annotations.append(
            Annotation(
                text=text[k],
                x=pos[k][0], y=pos[k][1],
                xref='x1', yref='y1',
                font=dict(color= font_color, size=font_size),
                showarrow=False)
        )
    return annotations  

Setelah penambahan anotasi, maka selanjutnya ditambahan matriks yang digunakan untuk posisi node dan edge

Ad=np.array(  [[0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1], # Adjacency matrix
                         [1,0,0,0,0,1,0,1,0,1,0,0,1,1,0,1,1,0,0,1,0,0,0],
                         [1,0,0,0,1,0,1,1,0,1,0,0,1,0,0,0,1,1,0,0,1,0,0],
                         [0,0,0,0,1,0,1,1,0,1,0,1,0,1,1,0,0,0,1,0,0,1,0],
                         [1,0,0,0,1,1,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0],
                         [0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,1],
                         [1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,1,0,0],
                      [1,0,0,0,1,1,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1],
                      [0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,1,0,0],
                      [0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,1,0,0,1,0,0,1,0],
                      [1,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,1,0,1],
                      [0,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,1,0,1,0,0,1],
                      [1,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1],
                      [0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0],
                      [1,0,0,1,1,1,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1],
[0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0],
                      [1,0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,0,0,0,1,0,1],
                      [1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,1,0],
                      [1,0,0,1,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0,0,1],
                      [0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,1], 
                      [1,1,0,0,0,1,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0,1,0],
                      [1,1,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,1],
                         [1,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0]], dtype=float)

Pada bagian ini adalah interpretasi dari bagaimana graph yang dibuat untuk relasi antar kata hasil ekstrak
                                                 
Gr=nx.from_numpy_matrix(Ad)
position=nx.spring_layout(Gr)
labels = get_continuous_chunks(berita) #perinta print di assign pada variabel labels 
traceE= scatter_edges(Gr, position)
traceN= scatter_nodes(position, labels=labels)

Pada bagian ini adalah pembuatan layout untuk graph yang dibuat

width=500
height=500
axis=dict(showline=False, # hide axis line, grid, ticklabels and  title
          zeroline=False,
          showgrid=False,
          showticklabels=False,
          title=''
          )
layout=Layout(title= 'Relasi entitas berita',  #
    font= Font(),
    showlegend=False,
    autosize=False,
    width=width,
    height=height,
    xaxis=XAxis(axis),
    yaxis=YAxis(axis),
    margin=Margin(
        l=40,
        r=40,
        b=85,
        t=100,
        pad=0,
      
    ),
    hovermode='closest',
    plot_bgcolor='#EFECEA', #set background color           
    )

Dan langkah terakhir ini adalah bagian untuk memunculkan  graph dalam bentuk figure

data1=Data([traceE, traceN])
fig = Figure(data=data1, layout=layout)
fig['layout'].update(annotations=make_annotations(position, [str(k) for k in range(len(position))])) 
py.iplot(fig, filename='Tubes')

Jika keseluruhan source code diatas di compile maka hasil yang didapatkan adalah


Karena dalam pembuatan graph memanfaatkan API dari plotly maka dapat dilihat pada



Nama masing-masing entitas























Untuk mengunduh source code secara penuh dapat silahkan mengklik  Download source code

Komentar

Postingan populer dari blog ini

DAP (Variabel.Record, I/O, Assignmet dan Operator) - Bagian 2

Perkenalan